<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://wiki.old.lustre.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Lollsolo</id>
	<title>Obsolete Lustre Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="http://wiki.old.lustre.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Lollsolo"/>
	<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Special:Contributions/Lollsolo"/>
	<updated>2026-05-08T05:01:22Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.7</generator>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4305</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4305"/>
		<updated>2008-03-13T07:25:02Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 28 MONDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM- 9:30AM  Welcome &amp;amp; Introduction &lt;br /&gt;
&lt;br /&gt;
- Peter Bojanic, Director, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM- 10AM  Lustre Business Update&lt;br /&gt;
&lt;br /&gt;
- Kevin Canady, Director of Business Development, Sun&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  Lustre Engineering Update&lt;br /&gt;
&lt;br /&gt;
- Eric Barton, Lustre Lead Engineer, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 12AM  Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
&lt;br /&gt;
- Peter Braam ,VP of Lustre, Sun&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM- 2PM  1.8 Features &amp;amp; Benefits&lt;br /&gt;
&lt;br /&gt;
- Bryon Neitzel, Lustre, Sun&lt;br /&gt;
&lt;br /&gt;
2PM- 2:30PM  Lustre at LLNL&lt;br /&gt;
&lt;br /&gt;
- Mark Gary, LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM- 3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM- 3:30PM  A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
&lt;br /&gt;
- Shane Canon, ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM- 4PM  DARPA HPCS Project&lt;br /&gt;
&lt;br /&gt;
- John Carrier, Cray&lt;br /&gt;
&lt;br /&gt;
4PM- 4:30PM  ILM- Lustre HSM&lt;br /&gt;
&lt;br /&gt;
- Aurelien Degremont, CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM- 5PM&lt;br /&gt;
&lt;br /&gt;
6PM- 7PM  EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM- 9:30AM  Keynote Speaker&lt;br /&gt;
&lt;br /&gt;
- TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM- 10AM  Lustre Partner: Terascala&lt;br /&gt;
&lt;br /&gt;
- Rick Friedman, VP of Marketing and Product, Terascala&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  HPC Software Stack for Linux&lt;br /&gt;
&lt;br /&gt;
- Makia Minich, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM- 11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 12AM  BOF on Lustre over WAN&lt;br /&gt;
&lt;br /&gt;
- Eric Barton, ORNL, TACC, IU&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM  LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM- 1:30PM  Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
&lt;br /&gt;
- Nicholas P. Cardo, LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM- 2PM  Lustre Tools Session-Shine, Administration Tool&lt;br /&gt;
&lt;br /&gt;
- Stephane Thiell, CEA&lt;br /&gt;
&lt;br /&gt;
2PM- 2:30PM  Guest Talk&lt;br /&gt;
&lt;br /&gt;
- TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM- 3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM- 3:30PM  Lustre User Space Servers with ZFS&lt;br /&gt;
&lt;br /&gt;
- Ricardo Correia, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM- 4PM  NFS/pNFS export&lt;br /&gt;
&lt;br /&gt;
- Oleg Drokin, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
4PM- 4:30PM  Scientific Application Performance with Lustre&lt;br /&gt;
&lt;br /&gt;
-Tom Wang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM- 5PM  LNet Selftest&lt;br /&gt;
&lt;br /&gt;
- Isaac Huang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM- 10AM  Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  Clustered Metadata Severs&lt;br /&gt;
&lt;br /&gt;
- Hua Huang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM- 11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 11:30AM  Metadata Performance with Size on MDS&lt;br /&gt;
&lt;br /&gt;
- Vitaly Fertman, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM- 12AM  Discussion Panel&lt;br /&gt;
&lt;br /&gt;
- Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM  WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4304</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4304"/>
		<updated>2008-03-13T07:24:24Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM- 9:30AM  Welcome &amp;amp; Introduction &lt;br /&gt;
&lt;br /&gt;
- Peter Bojanic,Director,Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM- 10AM  Lustre Business Update&lt;br /&gt;
&lt;br /&gt;
- Kevin Canady, Director of Business Development, Sun&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  Lustre Engineering Update&lt;br /&gt;
&lt;br /&gt;
- Eric Barton, Lustre Lead Engineer, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 12AM  Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
&lt;br /&gt;
- Peter Braam ,VP of Lustre, Sun&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM- 2PM  1.8 Features &amp;amp; Benefits&lt;br /&gt;
&lt;br /&gt;
- Bryon Neitzel, Lustre, Sun&lt;br /&gt;
&lt;br /&gt;
2PM- 2:30PM  Lustre at LLNL&lt;br /&gt;
&lt;br /&gt;
- Mark Gary, LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM- 3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM- 3:30PM  A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
&lt;br /&gt;
- Shane Canon, ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM- 4PM  DARPA HPCS Project&lt;br /&gt;
&lt;br /&gt;
- John Carrier, Cray&lt;br /&gt;
&lt;br /&gt;
4PM- 4:30PM  ILM- Lustre HSM&lt;br /&gt;
&lt;br /&gt;
- Aurelien Degremont, CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM- 5PM&lt;br /&gt;
&lt;br /&gt;
6PM- 7PM  EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM- 9:30AM  Keynote Speaker&lt;br /&gt;
&lt;br /&gt;
- TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM- 10AM  Lustre Partner: Terascala&lt;br /&gt;
&lt;br /&gt;
- Rick Friedman, VP of Marketing and Product, Terascala&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  HPC Software Stack for Linux&lt;br /&gt;
&lt;br /&gt;
- Makia Minich, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM- 11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 12AM  BOF on Lustre over WAN&lt;br /&gt;
&lt;br /&gt;
- Eric Barton, ORNL, TACC, IU&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM  LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM- 1:30PM  Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
&lt;br /&gt;
- Nicholas P. Cardo, LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM- 2PM  Lustre Tools Session-Shine, Administration Tool&lt;br /&gt;
&lt;br /&gt;
- Stephane Thiell, CEA&lt;br /&gt;
&lt;br /&gt;
2PM- 2:30PM  Guest Talk&lt;br /&gt;
&lt;br /&gt;
- TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM- 3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM- 3:30PM  Lustre User Space Servers with ZFS&lt;br /&gt;
&lt;br /&gt;
- Ricardo Correia, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM- 4PM  NFS/pNFS export&lt;br /&gt;
&lt;br /&gt;
- Oleg Drokin, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
4PM- 4:30PM  Scientific Application Performance with Lustre&lt;br /&gt;
&lt;br /&gt;
-Tom Wang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM- 5PM  LNet Selftest&lt;br /&gt;
&lt;br /&gt;
- Isaac Huang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM- 10AM  Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM- 10:30AM  Clustered Metadata Severs&lt;br /&gt;
&lt;br /&gt;
- Hua Huang, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM- 11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM- 11:30AM  Metadata Performance with Size on MDS&lt;br /&gt;
&lt;br /&gt;
- Vitaly Fertman, Lustre Group, Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM- 12AM  Discussion Panel&lt;br /&gt;
&lt;br /&gt;
- Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM- 1PM  WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4303</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4303"/>
		<updated>2008-03-13T07:04:25Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 28 MONDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM  Welcome &amp;amp; Introduction &lt;br /&gt;
&lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Business Update&lt;br /&gt;
&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Lustre Engineering Update&lt;br /&gt;
&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM  1.8 Features &amp;amp; Benefits&lt;br /&gt;
&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Lustre at LLNL&lt;br /&gt;
&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  DARPA HPCS Project&lt;br /&gt;
&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  ILM-Lustre HSM&lt;br /&gt;
&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM  EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM  Keynote Speaker&lt;br /&gt;
&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Partner:Terascala&lt;br /&gt;
&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  HPC Software Stack for Linux&lt;br /&gt;
&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  BOF on Lustre over WAN&lt;br /&gt;
&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM  Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM  Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Guest Talk&lt;br /&gt;
&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  Lustre User Space Servers with ZFS&lt;br /&gt;
&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  NFS/pNFS export&lt;br /&gt;
&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  Scientific Application Performance with Lustre&lt;br /&gt;
&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM  LNet Selftest&lt;br /&gt;
&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM  Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Clustered Metadata Severs&lt;br /&gt;
&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM  Metadata Performance with Size on MDS&lt;br /&gt;
&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM  Discussion Panel&lt;br /&gt;
&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4302</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4302"/>
		<updated>2008-03-13T07:02:38Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM  Welcome &amp;amp; Introduction &lt;br /&gt;
&lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Business Update&lt;br /&gt;
&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Lustre Engineering Update&lt;br /&gt;
&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM  1.8 Features &amp;amp; Benefits&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Lustre at LLNL&lt;br /&gt;
&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  ILM-Lustre HSM&lt;br /&gt;
&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM  EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM  Keynote Speaker&lt;br /&gt;
&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Partner:Terascala&lt;br /&gt;
&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  HPC Software Stack for Linux&lt;br /&gt;
&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  BOF on Lustre over WAN&lt;br /&gt;
&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM  Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM  Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Guest Talk&lt;br /&gt;
&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  Lustre User Space Servers with ZFS&lt;br /&gt;
&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  NFS/pNFS export&lt;br /&gt;
&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  Scientific Application Performance with Lustre&lt;br /&gt;
&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM  LNet Selftest&lt;br /&gt;
&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM  Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Clustered Metadata Severs&lt;br /&gt;
&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM  Metadata Performance with Size on MDS&lt;br /&gt;
&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM  Discussion Panel&lt;br /&gt;
&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4301</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4301"/>
		<updated>2008-03-13T06:55:31Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM  Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM  1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Lustre at LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM  EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM  Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM  Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM  BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM  Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM  Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM  Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM  Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM  NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM  Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM  LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM  Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM  Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM  COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM  Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM  Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM  WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4300</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4300"/>
		<updated>2008-03-13T06:52:46Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 30 WEDNESDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre at LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA BOF on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4299</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4299"/>
		<updated>2008-03-13T06:52:11Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 29 TUESDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre at LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4298</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4298"/>
		<updated>2008-03-13T06:49:53Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 29 TUESDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre at LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backup&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4297</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4297"/>
		<updated>2008-03-13T06:46:31Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 28 MONDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre at LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4296</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4296"/>
		<updated>2008-03-13T06:45:46Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 28 MONDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8 Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4295</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4295"/>
		<updated>2008-03-13T06:43:55Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 29 TUESDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4294</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4294"/>
		<updated>2008-03-13T06:43:41Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 28 MONDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30AM-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4293</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4293"/>
		<updated>2008-03-13T06:41:35Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* APRIL 29 TUESDAY */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM LNet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4292</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4292"/>
		<updated>2008-03-13T06:40:40Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM Lnet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
===APRIL 30 WEDNESDAY===&lt;br /&gt;
9AM-10AM Joint CEA/NNSA Bof on PetaScale I/O Issues&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Clustered Metadata Severs&lt;br /&gt;
-Hua Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-11:30AM Metadata Performance with Size on MDS&lt;br /&gt;
-Vitaly Fertman,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
11:30AM-12AM Discussion Panel&lt;br /&gt;
-Core Lustre Engineers&lt;br /&gt;
&lt;br /&gt;
12AM-1PM WRAP UP&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4291</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4291"/>
		<updated>2008-03-13T06:29:52Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM Lustre Engineering Update&lt;br /&gt;
-Eric Barton,Lustre Lead Engineer,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM Lustre Architecture &amp;amp; Roadmap&lt;br /&gt;
-Peter Braam,VP of Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-2PM 1.8Features &amp;amp; Benefits&lt;br /&gt;
-Bryon Neitzel,Lustre,Sun&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Lustre ay LLNL&lt;br /&gt;
-Mark Gary,LLNL&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM A Global File System with Lustre &amp;amp;LNET Routers&lt;br /&gt;
-Shane Canon,ORNL&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM DARPA HPCS Project&lt;br /&gt;
-John Carrier,Cray&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM ILM-Lustre HSM&lt;br /&gt;
-Aurelien Degremont,CEA&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;br /&gt;
&lt;br /&gt;
===APRIL 29 TUESDAY===&lt;br /&gt;
9 AM-9:30AM Keynote Speaker&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Partner:Terascala&lt;br /&gt;
-Rick Friedman,VP of Marketing and Product,Terascala&lt;br /&gt;
&lt;br /&gt;
10AM-10:30AM HPC Software Stack for Linux&lt;br /&gt;
-Makia Minich,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
10:30AM-11AM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
11AM-12AM BOF on Lustre over WAN&lt;br /&gt;
-Eric Barton,ORNL,TACC,IU&lt;br /&gt;
&lt;br /&gt;
12AM-1PM LUNCH&lt;br /&gt;
&lt;br /&gt;
1PM-1:30PM Lustre Tools Session-Backuo&amp;amp;Quota&lt;br /&gt;
-Nicholas P.Cardo,LBNL&lt;br /&gt;
&lt;br /&gt;
1:30PM-2PM Lustre Tools Session-Shine,Administration Tool&lt;br /&gt;
-Stephane Thiell,CEA&lt;br /&gt;
&lt;br /&gt;
2PM-2:30PM Guest Talk&lt;br /&gt;
-TBD&lt;br /&gt;
&lt;br /&gt;
2:30PM-3PM COFFEE BREAK&lt;br /&gt;
&lt;br /&gt;
3PM-3:30PM Lustre User Space Servers with ZFS&lt;br /&gt;
-Ricardo Correia,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
3:30PM-4PM NFS/pNFS export&lt;br /&gt;
-Oleg Drokin,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4PM-4:30PM Scientific Application Performance with Lustre&lt;br /&gt;
-Tom Wang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
4:30PM-5PM Lnet Selftest&lt;br /&gt;
-Isaac Huang,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
6PM-7PM EVENING RECEPTION&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4290</id>
		<title>Lustre User Group 2008</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_User_Group_2008&amp;diff=4290"/>
		<updated>2008-03-13T05:44:28Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AGENDA==&lt;br /&gt;
===APRIL 28 MONDAY===&lt;br /&gt;
9 AM-9:30AM Welcome &amp;amp; Introduction &lt;br /&gt;
-Peter Bojanic,Director,Lustre Group,Sun&lt;br /&gt;
&lt;br /&gt;
9:30-10AM Lustre Business Update&lt;br /&gt;
-Kevin Canady,Director of Business Development,Sun&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4266</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4266"/>
		<updated>2008-02-18T05:53:28Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
** 2007&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
==TeraGrid==&lt;br /&gt;
*&#039;&#039;&#039;Wide Wrea Filesystem Performance Using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
**Teragrid 2007 Conference,Madison,WI&#039;&#039;&#039;&lt;br /&gt;
**[http://wiki.lustre.org/index.php?title=Image:Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4265</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4265"/>
		<updated>2008-02-18T05:51:55Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* TeraGrid */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
** 2007&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
==TeraGrid==&lt;br /&gt;
*&#039;&#039;&#039;Wide Wrea Filesystem Performance Using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
**Teragrid 2007 Conference,Madison,WI&#039;&#039;&#039;&lt;br /&gt;
**[http://wiki.lustre.org/index.php?title=Image:Lustre_wan_tg07.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4264</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4264"/>
		<updated>2008-02-18T04:22:13Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* TeraGrid */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
** 2007&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
==TeraGrid==&lt;br /&gt;
*&#039;&#039;&#039;Wide Wrea Filesystem Performance Using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
**teragrid 2007 conference,madison,wi&#039;&#039;&#039;&lt;br /&gt;
**[http://wiki.lustre.org/index.php?title=Image:Lustre_wan_tg07.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4263</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4263"/>
		<updated>2008-02-18T04:19:22Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
** 2007&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
==TeraGrid==&lt;br /&gt;
*&#039;&#039;&#039;Wide Wrea Filesystem Performance Using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
**teragrid 2007 conference,madison,wi&#039;&#039;&#039;&lt;br /&gt;
**Paper in PDF format&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4238</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4238"/>
		<updated>2008-01-28T05:14:39Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Glossary */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target ([http://www.clusterfs.com/faq-fundermentals.html#fund-3 what&#039;s the difference?] See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4237</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4237"/>
		<updated>2008-01-28T05:11:03Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4236</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4236"/>
		<updated>2008-01-28T05:09:47Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Do you plan to support OSS failover without shared storage? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4235</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4235"/>
		<updated>2008-01-28T05:08:23Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4234</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4234"/>
		<updated>2008-01-28T05:07:30Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Does Lustre use/provide a single security domain? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4233</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4233"/>
		<updated>2008-01-28T05:06:22Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* What is the licensing model for the Lustre file system for Linux? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4232</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4232"/>
		<updated>2008-01-28T05:05:26Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Which Lustre support services are available? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/docs/Lustre-Roadmap.pdf roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4231</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4231"/>
		<updated>2008-01-28T05:01:30Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* What is the licensing model for the Lustre file system for Linux? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are [http://www.sun.com/software/products/lustre/get.jsp made available] to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/specs.jsp roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4230</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4230"/>
		<updated>2008-01-28T04:57:16Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Which Lustre support services are available? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of [http://www.sun.com/software/products/lustre/specs.jsp roadmap] features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4229</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4229"/>
		<updated>2008-01-28T04:56:16Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* What is the licensing model for the Lustre file system for Linux? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4228</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4228"/>
		<updated>2008-01-28T04:51:23Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Which operating systems are/will be supported? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. &lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4227</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4227"/>
		<updated>2008-01-28T04:47:37Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Do you plan to support OSS failover without shared storage? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the [http://www.sun.com/software/products/lustre/specs.jsp roadmap], these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4226</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4226"/>
		<updated>2008-01-28T04:45:52Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* How do I automate failover of my OSSs? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery] .&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4225</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4225"/>
		<updated>2008-01-28T04:44:53Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* What is a typical OSS node configuration? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4224</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4224"/>
		<updated>2008-01-28T04:44:01Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* How many clients can each OSS support? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Sizing Sizing].&lt;br /&gt;
&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4223</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4223"/>
		<updated>2008-01-28T04:41:50Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* How do I automate failover of my MDSs? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Recovery Recovery].&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4222</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4222"/>
		<updated>2008-01-28T04:40:21Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* What is the typical MDS node configuration? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#Installation Installation].&lt;br /&gt;
&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery.&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4221</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4221"/>
		<updated>2008-01-28T04:37:35Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Which operating systems are supported as clients and servers? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see [http://wiki.lustre.org/index.php?title=Lustre_FAQ#OS_Support OS Support].&lt;br /&gt;
&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation.&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery.&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4220</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4220"/>
		<updated>2008-01-28T04:32:12Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Does Lustre use/provide a single security domain? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see OS Support .&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation.&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery.&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4219</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4219"/>
		<updated>2008-01-28T04:30:55Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap]).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see OS Support .&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation.&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery.&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4218</id>
		<title>Lustre FAQ</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_FAQ&amp;diff=4218"/>
		<updated>2008-01-28T04:29:31Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Glossary ==&lt;br /&gt;
&lt;br /&gt;
* ACL: Access Control List&lt;br /&gt;
* DLM: Distributed Lock Manager&lt;br /&gt;
* EA: Extended Attribute&lt;br /&gt;
* FC: Fibrechannel&lt;br /&gt;
* HPC: High-Performance Computing&lt;br /&gt;
* IB: InfiniBand&lt;br /&gt;
* MDS: Metadata Server&lt;br /&gt;
* NAL: Network Abstraction Layer; a software module which provides support for a particular interconnect&lt;br /&gt;
* OSS: Object Storage Server&lt;br /&gt;
* OST: Object Storage Target (what&#039;s the difference? See section 2.3 )&lt;br /&gt;
&lt;br /&gt;
== Fundamentals ==&lt;br /&gt;
=== Can you describe the data caching and cache coherency method?===&lt;br /&gt;
&lt;br /&gt;
There is complete cache coherence for both metadata (names, readdir lists, inode attributes) and file data. Clients and servers both take locks with a distributed lock management service; caches are flushed before locks are released.&lt;br /&gt;
=== Does Lustre separate metadata and file data?===&lt;br /&gt;
&lt;br /&gt;
Yes. The entire namespace is stored on Lustre Metadata Servers (MDSs); file data is stored on Lustre Object Storage Servers (OSSs).&lt;br /&gt;
&lt;br /&gt;
The data for each file may reside in multiple objects on separate servers. Lustre 1.x manages these objects in a RAID-0 (striping) configuration, so each object in a multi-object file contains only a part of the file&#039;s data. Future versions of Lustre will allow the user or administrator to choose other striping =methods, such as RAID-1 or RAID-5 redundancy.&lt;br /&gt;
=== What is the difference between an OST and an OSS?===&lt;br /&gt;
&lt;br /&gt;
There is a lot of confusion, and it&#039;s mostly our fault; as the architecture evolved, we refined these terms, and it has been difficult to enforce.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Server (OSS) is a server node, running the Lustre software stack. It has one or more network interfaces, and usually one or more disks.&lt;br /&gt;
&lt;br /&gt;
An Object Storage Target (OST) is an interface to a single exported backend volume. It is conceptually similar to an NFS export, except that an OST does not contain a whole namespace, but rather file system objects.&lt;br /&gt;
&lt;br /&gt;
It is common for a single OSS to export more than one OST? Yes, for example to get around the Linux 2.6 maximum 8 TB partition size. Although Lustre will aggregate multiple OSTs into a single large file system, the individual OST partitions are 8 TB.&lt;br /&gt;
=== Does Lustre perform high-level I/O load balancing?===&lt;br /&gt;
&lt;br /&gt;
Yes. Because a single file can reside in pieces on many servers, the I/O load for even a single file can be distributed.&lt;br /&gt;
&lt;br /&gt;
By default, objects are randomly distributed amongst OSTs.&lt;br /&gt;
=== Is there a common synchronized namespace for files and directories?===&lt;br /&gt;
&lt;br /&gt;
Yes. All clients which mount the file system will see a single, coherent, synchronized namespace at all times.&lt;br /&gt;
=== Can Lustre be used as part of a &amp;quot;single system image&amp;quot; installation?===&lt;br /&gt;
&lt;br /&gt;
Yes. Lustre as the root file system is being used by some installations on both clients and servers, although it will not be officially supported until Lustre 1.6.x (see the [http://www.sun.com/software/products/lustre/specs.jsp roadmap] ).&lt;br /&gt;
&lt;br /&gt;
=== Do Lustre clients use NFS to reach the servers?===&lt;br /&gt;
&lt;br /&gt;
No. Client nodes run a native Lustre client file system driver, which uses the Lustre metadata and object protocols to communicate with the servers. The NFS protocol is not suitable to meet Lustre&#039;s metadata, I/O, locking, recovery, or performance requirements.&lt;br /&gt;
=== Does Lustre use/provide a single security domain?===&lt;br /&gt;
&lt;br /&gt;
Current versions of Lustre expect the clients and servers to have an identical understanding of UIDs and GIDs, but security is enforced on the Metadata Server by a server-local PAM-managed group database. Strong security and POSIX Access Control Lists are coming in future versions (see the roadmap ).&lt;br /&gt;
=== Does Lustre support the standard POSIX file system APIs?===&lt;br /&gt;
&lt;br /&gt;
Yes. Applications which use standard POSIX file system APIs can run on Lustre without modifications.&lt;br /&gt;
=== Is Lustre &amp;quot;POSIX compliant&amp;quot;? Are there any exceptions?===&lt;br /&gt;
&lt;br /&gt;
POSIX does not, strictly speaking, say anything about how a file system will operate on multiple clients. However, Lustre conforms to the most reasonable interpretation of what the single-node POSIX requirements would mean in a clustered environment.&lt;br /&gt;
&lt;br /&gt;
For example, the atomicity of read and write operations are enforced through the Lustre distributed lock manager; if application threads running on multiple nodes were to try to read and write the same part of a file at the same time, they would both see consistent results. This is true of all I/O and metadata operations, with two exceptions:&lt;br /&gt;
&lt;br /&gt;
* 1. atime updates&lt;br /&gt;
&lt;br /&gt;
It is not practical to maintain fully coherent atime updates in a high-performance cluster file system. Lustre will update the atime of files lazily -- if an inode needs to be changed on disk anyways, we will piggy-back an atime update if needed -- and when files are closed.&lt;br /&gt;
* 2. flock/lockf&lt;br /&gt;
&lt;br /&gt;
POSIX and BSD flock/lockf system calls will be completely coherent across the cluster, using the Lustre lock manager, but are not yet supported today. They will be soon (see the roadmap).&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.1, there was one additional deviation from POSIX, in the area of mmap I/O. In 1.4.1, changes were made to support cache-coherent mmap I/O and robust execution of binaries and shared libraries residing in Lustre. mmap() I/O is now coherent and synchronized via the Lustre lock manager, although there may be pathological cases that remain hazardous for some time.&lt;br /&gt;
=== Can you grow/shrink file systems online?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.6 contains support for online addition of OST targets either on a new or on an existing OSS.  In an upcoming version of Lustre the recently added support for online resizing of ext3 volumes will provide an additional way of growing file systems.  Shrinking is not supported.&lt;br /&gt;
=== Which disk file systems are supported as Lustre backend file systems?===&lt;br /&gt;
&lt;br /&gt;
Lustre includes a patched version of the ext3 file system, with additional features such as extents, an efficient multi-block allocator, htree directories, large inodes, extended attributes, transaction optimizations, fine-grained locking, and CPU affinity for critical operations. This is the only supported backend file system today.&lt;br /&gt;
=== Why did Lustre choose ext3? Do you ever plan to support others?===&lt;br /&gt;
&lt;br /&gt;
There are many reasons to choose ext3. One is size; at under 15,000 lines of code, it is extremely understandable, maintainable, and modifiable. Another is reliability; ext3 is proven stable by millions of users, with an excellent file system repair tool.&lt;br /&gt;
&lt;br /&gt;
When we began, there was a big difference between the various Linux file systems, particularly with respect to performance. In the last three years, however, Lustre has carried ext3 very substantially forward, and it is now extremely competitive with other Linux file systems.&lt;br /&gt;
&lt;br /&gt;
Nevertheless, we had originally planned to support multiple file systems, so Lustre does contain a file system abstraction. The Solaris Port of the OSS will support ZFS.&lt;br /&gt;
&lt;br /&gt;
=== Why didn&#039;t you use IBM&#039;s distributed lock manager?===&lt;br /&gt;
&lt;br /&gt;
The design of the Lustre DLM borrows heavily from the VAX Clusters DLM, plus extensions that are not found in others. Although we have received some reasonable criticism for not using an existing package (such as IBM&#039;s DLM), experience thus far has seemed to indicate that we&#039;ve made the correct choice: it&#039;s smaller, simpler and, at least for our needs, more extensible.&lt;br /&gt;
&lt;br /&gt;
The Lustre DLM, at around 6,000 lines of code, has proven to be an overseeable maintenance task, despite its somewhat daunting complexity. The IBM DLM, by comparison, was nearly the size of all of Lustre combined. This is not necessarily a criticism of the IBM DLM, however; to its credit, it is a complete DLM which implements many features which we do not require in Lustre.&lt;br /&gt;
&lt;br /&gt;
In particular, Lustre&#039;s DLM is not really distributed, at least not when compared to other such systems. Locks in the Lustre DLM are always managed by the service node, and do not change masters as other systems allow. Omitting features of this type has allowed us to rapidly develop and stabilize the core DLM functionality required by the file system, plus add several extensions of our own (extent locking, intent locking, policy functions, glimpse ASTs, and a different take on lock value blocks).&lt;br /&gt;
=== Are services at user or kernel level? How do they communicate?===&lt;br /&gt;
&lt;br /&gt;
All daemons on a single node run in the kernel, and therefore share a single address space. Daemons on different nodes communicate through RPC messages; large messages are sent using remote DMA if the fabric supports it.&lt;br /&gt;
&lt;br /&gt;
== Sizing ==&lt;br /&gt;
&lt;br /&gt;
=== What is the maximum file system size? What is the largest file system you&#039;ve tested? ===&lt;br /&gt;
&lt;br /&gt;
Each backend OST file system is restricted to a maximum of 2 TB on Linux 2.4 (imposed by the kernel block device), or 8 TB on Linux 2.6 (imposed by ext3). Of course, it is possible to have multiple OST file systems on a single OSS, and to aggregate multiple OSSs.  Running with almost 4000 thousand OST&#039;s has been tried - hence 32PB file systems can be achieved today. &lt;br /&gt;
&lt;br /&gt;
Lustre users already run single production filesystems of 1.4PB.&lt;br /&gt;
=== What is the maximum file system block size? ===&lt;br /&gt;
&lt;br /&gt;
The basic ext3 block size is 4096 bytes, although this could in principle be easily changed to a large PAGE_SIZE (on IA64, for example) with a few modifications to ext3. It is not clear, however, that this is necessary.&lt;br /&gt;
&lt;br /&gt;
Some people confuse block size with extent size or I/O request size -- they are not the same thing. The block size is the basic unit of disk allocation, and for our purposes it seems that 4kB is as good as any. The size of a single file extent, by definition, is almost always larger than 4kB, and ext3 improvements in Lustre 1.4.x do a good job of using contiguous disk blocks whenever possible. The I/O request size (the amount of data that we try to read or write in a single request) is much much larger.&lt;br /&gt;
=== What is the maximum single-file size? ===&lt;br /&gt;
&lt;br /&gt;
On 32-bit clients, the page cache makes it quite difficult to read or write a single file larger than 8 TB.  On 64 bit clusters, the maximum file size is 2^64.  A current Lustre limit for allocated file space arises from a maximum of 160 stripes and about 8TB per stripe, leading to about 1.28PB per file.&lt;br /&gt;
=== What is the maximum number of files in a single file system? In a single directory? ===&lt;br /&gt;
&lt;br /&gt;
We use the ext3 hashed directory code, which has a theoretical limit of ~134 million files per directory, at which point the directory grows to more than 2 GB. The maximum number of subdirectories is 32,000 in versions prior to Lustre 1.2.6 and is unlimited in later versions (small ext3 format change).&lt;br /&gt;
&lt;br /&gt;
More realistically, we regularly run tests with ten million files in a single directory. On a properly-configured dual-CPU MDS with 4 GB of ram, it is possible to do random lookups in this directory at a rate of 5,000/second.&lt;br /&gt;
&lt;br /&gt;
A single MDS imposes an upper limit of 4 billion inodes, but the default limit is slightly less than the device size / 4kB, so about 512 million inodes for a 2 TB MDS file system. This can be increased at initial file system creation time by specifying mkfs options. We regularly test with file systems containing approximately 100 million files.&lt;br /&gt;
&lt;br /&gt;
With the introduction of clustered metadata servers (Lustre 2.0), these limits will disappear.&lt;br /&gt;
=== How many OSSs do I need? ===&lt;br /&gt;
&lt;br /&gt;
The short answer is: as many as you need to achieve the required aggregate I/O throughput.&lt;br /&gt;
&lt;br /&gt;
The long answer is: Each OSS contributes to the total capacity and the aggregate throughput. For example, a 100 TB file system may use 100 single-gige-connected OSS nodes with 1 TB of 100 MB/sec storage each, providing 10 GB/sec of aggregate bandwidth. The same bandwidth and capacity could be provided with four heavy-duty 25 TB OSS servers with three Elan 4 interfaces and 16 FC2 channels, each providing ~2.5 GB/s in aggregate bandwidth. The 25 TB of storage must be capable of 2.5 GB/s.&lt;br /&gt;
&lt;br /&gt;
Each OSS can support a very large number of clients, so we do not advise our customers to use any particular client-to-OSS ratio. Nevertheless, it is common to deploy 1 GB/s of OSS throughput per 1 TFLOP/s of compute power.&lt;br /&gt;
=== What is the largest possible I/O request? ===&lt;br /&gt;
&lt;br /&gt;
When most people ask this question, they are asking what is the maximum buffer size that can be safely passed to a read() or write() system call. In principle this is limited only by the address space on the client, although in practice it is possible to request so much I/O in a single atomic unit that the cluster infrastructure will see this as non-responsiveness and generate timeouts. Depending on your I/O infrastructure this can be 100 MB or more, however, so it has not been an issue in reality. In any case, we are aware of this limitation, and will work to remove it in a future release.&lt;br /&gt;
&lt;br /&gt;
The Lustre I/O subsystem is designed with the understanding that an I/O request travels through many pipelines, and that it&#039;s important to keep all pipelines full for maximum performance. So it is not necessary to teach your application to do I/O in very large chunks; Lustre and the page cache will aggregate I/O for you.&lt;br /&gt;
&lt;br /&gt;
Typically, Lustre client nodes will do their best to aggregate I/O into 1 MB chunks on the wire, and to keep between 5 and 10 I/O requests &amp;quot;in flight&amp;quot; at a time, per server.&lt;br /&gt;
&lt;br /&gt;
On the OSS, we have gone to significant effort to ensure that these large 1 MB buffers do not get unnecessarily broken up by lower kernel layers. In Linux 2.4, modifications were required to the SCSI layer, block devices, and the QLogic fibrechannel driver. Because nearly 100% of our customers use QLogic fibrechannel cards, we have not yet optimized other drivers in this way.&lt;br /&gt;
=== How many nodes can connect to a single Lustre file system? ===&lt;br /&gt;
&lt;br /&gt;
The largest single production Lustre installation is approximately 25,000 nodes today (2006). A Cray XT3 cluster of almost 12,000 nodes is currently being deployed, although it will likely run in production as two separate 6,000-node clusters.&lt;br /&gt;
&lt;br /&gt;
Although these are the largest clusters available to us today, we believe that the architecture is fundamentally capable of supporting many tens of thousands of clients.&lt;br /&gt;
&lt;br /&gt;
== Installation ==&lt;br /&gt;
===Which operating systems are supported as clients and servers?===&lt;br /&gt;
&lt;br /&gt;
Please see OS Support .&lt;br /&gt;
===Can you use NFS or CIFS to reach a Lustre volume?===&lt;br /&gt;
&lt;br /&gt;
Yes. Any native Lustre client (running Linux today, by definition) can export a volume using Samba (or NFS, but keep reading). Some people have even built small clusters of these export nodes, to improve overall performance to their non-native clients.&lt;br /&gt;
&lt;br /&gt;
Although NFS export works today, we don&#039;t support it because it is extremely slow. Lustre will require some specific optimizations to work around the behaviour of the Linux kernel NFS server. We&#039;re working on these, but in the meantime, we suggest Samba.&lt;br /&gt;
&lt;br /&gt;
CIFS export with Samba, even a cluster of such nodes, is possible with one caveat: oplocks and Windows share modes. If you connect to these Samba shares with Windows clients, they will probably make heavy use of share modes and oplocks for locking and synchronization. Samba implements them internally, and does not yet have a clustered mode to coordinate them between multiple servers running on separate nodes. So if you rely on the consistency of these share modes and oplocks, you should use a single node to export CIFS.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
1,000-node clusters with moderate metadata loads are commonly supported with a dual-Xeon node with 2GB of RAM, providing sustained throughput of over 1,000 ops/second. It is common for these systems to have roughly 20 million files. Even in 1,000-client clusters, the single MDS has been shown not to be a bottleneck under typical loads.&lt;br /&gt;
&lt;br /&gt;
High throughput with very large directories is possible with 64-bit architectures and extra RAM. Typically, write I/O is low, but seek latency is very important, hence RAID-1 mirrored storage can be recommended.&lt;br /&gt;
&lt;br /&gt;
Storage requirements should be sized at approximately 4kB per file, except in unusual circumstances.&lt;br /&gt;
===What is the typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
IA32 systems support up to 150 MB/s over dual-GigE, and DMA-capable networks are typically limited only by the bus. 64-bit servers with good buses are capable of saturating multiple interconnects of any type. These nodes are often dual-CPU and support up to 4 fibrechannel channels. RAM is used for locks and read caching, but large amounts of RAM are usually not necessary, even for 1,000-node clusters.&lt;br /&gt;
===Which architectures are interoperable?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.4 requires the page size on server nodes (MDS and OSS) to be smaller or the same size as client nodes. Except for this, there are no known obstacles to interoperability, even among heterogeneous client groups and systems with different endian.&lt;br /&gt;
&lt;br /&gt;
===Which storage devices are supported, on MDS and OSS nodes?===&lt;br /&gt;
&lt;br /&gt;
Servers support all block storage: fibrechannel, SCSI, SATA, ATA and exotic storage (NVRAM) are supported. &lt;br /&gt;
===Which storage interconnects are supported?===&lt;br /&gt;
&lt;br /&gt;
Just to be clear: Lustre does not require a SAN, nor does it require a fabric like iSCSI. It will work just fine over simple IDE block devices. But because many people already have SANs, or want some amount of shared storage for failover, this is a common question.&lt;br /&gt;
&lt;br /&gt;
For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other block storage protocol can be used. Failover functionality requires shared storage (each partition used active/passive) between a pair of nodes on a fabric like SCSI, FC or SATA.&lt;br /&gt;
===Are fibrechannel switches necessary? How does HA shared storage work?===&lt;br /&gt;
&lt;br /&gt;
Typically, fibrechannel switches are not necessary. Multi-port shared storage for failover is normally configured to be shared between two server nodes on a FC-AL. Shared SCSI and future shared SATA devices will also work.&lt;br /&gt;
&lt;br /&gt;
Backend storage is expected to be cache-coherent between multiple channels reaching the devices. Servers in an OSS failover pair are normally both active in the file system, and will take over partitions for each other in the case of a failure. MDS failover pairs can also both be active, but only if they serve multiple separate file systems.&lt;br /&gt;
===Can you put the file system journal on a separate device?===&lt;br /&gt;
&lt;br /&gt;
Yes. This can be configured when the backend ext3 file systems are created.&lt;br /&gt;
===Can you run Lustre on LVM volumes, software RAID, etc?===&lt;br /&gt;
&lt;br /&gt;
Yes. You can use any Linux block device as storage for a backend Lustre server file system, including LVM or software RAID devices.&lt;br /&gt;
===Can you describe the installation process?===&lt;br /&gt;
&lt;br /&gt;
The current installation process is straightforward, but manual:&lt;br /&gt;
&lt;br /&gt;
1. Install the provided kernel and Lustre RPMs.&lt;br /&gt;
2. A configuration tool assistant can generate a configuration file for simple configurations, or you can build more complex configurations with relatively simple shell scripts.&lt;br /&gt;
3. Format and start the object servers and metadata servers. The command is usually identical on all nodes, so it&#039;s easy to use a utility like pdsh/prun to execute it.&lt;br /&gt;
4. Start the clients with &amp;quot;mount&amp;quot;, similar to how NFS is mounted. &lt;br /&gt;
&lt;br /&gt;
We are in the process of building new tools for our enterprise customers. New un-configured servers will announce themselves to the management node for template/profile-based addition to the cluster. The first such tools will appear in 2005.&lt;br /&gt;
===What is the estimated installation time per compute node?===&lt;br /&gt;
&lt;br /&gt;
Assuming that node doesn&#039;t require special drivers or kernel configuration, 5 minutes. Compute nodes can be installed and started in parallel.&lt;br /&gt;
===What is the estimated installation time per I/O node?===&lt;br /&gt;
&lt;br /&gt;
5 minutes, plus formatting time, which can also be done in parallel.&lt;br /&gt;
&lt;br /&gt;
== Networking ==&lt;br /&gt;
&lt;br /&gt;
===Which interconnects and protocols are currently supported?===&lt;br /&gt;
Today, Lustre supports TCP/IP (commonly over gigabit or 10-gigabit ethernet), Quadrics Elan 3 and 4, OpenIB generation 1, Voltaire IB (3.4.5+), OFED(1.1), Topspin(3.2.0),GM (Myrinet GM) (2.1.22+), CISCO and Cray&#039;s Rapid Array and Seastar networks.&lt;br /&gt;
&lt;br /&gt;
===Can I use more than one interface of the same type on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.6 and later.&lt;br /&gt;
===Can I use two or more different interconnects on the same node?===&lt;br /&gt;
&lt;br /&gt;
Yes, with Lustre 1.4.x, subject to the particular limitations of the interconnect. For example, we are told that it is not possible to use both Elan 3 and Elan 4 in the same node at the same time.&lt;br /&gt;
===Can I use TCP offload cards?===&lt;br /&gt;
&lt;br /&gt;
Probably -- but we&#039;ve tried many of these cards, and for various reasons we didn&#039;t see much improvement, if any. First, because Lustre runs entirely in the kernel, it uses kernel networking APIs which are often not supported (or at least not optimized) by the offload drivers.&lt;br /&gt;
&lt;br /&gt;
Second, the problem isn&#039;t the overhead of checksum calculation or the need for interrupt coalescing; lots of commodity ethernet cards already support these features. The big overhead is memory copying and buffering, which these cards rarely do anything to address.&lt;br /&gt;
===Does Lustre support crazy heterogeneous network topologies?===&lt;br /&gt;
&lt;br /&gt;
Yes, although the craziest of them are not yet fully supported.&lt;br /&gt;
&lt;br /&gt;
Because Lustre supports native protocols on top of high speed cluster interconnects (in addition to TCP/IP), some special infrastructure is necessary.&lt;br /&gt;
&lt;br /&gt;
Lustre uses its own implementation of the Portals message passing API, upon which we have implemented Gateway nodes, to route between two native protocols. These are commodity nodes with, for example, both gigabit ethernet and Quadrics Elan interfaces. The gateway software translates the Portals packets between the interfaces to bridge the two networks.&lt;br /&gt;
&lt;br /&gt;
These routers are in use today, and may become more popular as more enterprises connect multiple clusters with special interconnects to a single global Lustre file system. On the other hand, TCP/IP on gig-e is the interconnect of choice for most organizations, which requires no additional Portals routing.&lt;br /&gt;
&lt;br /&gt;
== Metadata Servers ==&lt;br /&gt;
&lt;br /&gt;
===How many metadata servers does Lustre support?===&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x supports up to two metadata servers (MDSs) per file system in an active/passive failover configuration, meaning that only one server is actually servicing requests as a time.&lt;br /&gt;
&lt;br /&gt;
You can configure multiple active metadata servers today, but they must each serve separate file systems.&lt;br /&gt;
&lt;br /&gt;
Lustre 2.x will introduced the clustered metadata feature, which will permit dozens or hundreds of metadata servers working in parallel for a single file system.&lt;br /&gt;
===How will clustered metadata work?===&lt;br /&gt;
&lt;br /&gt;
At a high level, it is reasonably simple: each directory can be striped over multiple metadata servers, each of which contains a disjoint portion of the namespace. When a client wants to lookup or create a name in that namespace, it uses a hashing algorithm to determine which metadata server holds the information for that name.&lt;br /&gt;
&lt;br /&gt;
When you consider the details of doing this efficiently, coherently, and completely recoverably in the face of any number of different failures, it becomes more complicated.&lt;br /&gt;
&lt;br /&gt;
We have already demonstrated very substantial portions of this functionality, including recovery, as a part of our DoE/NNSA PathForward effort. It will not be production-quality, however, for some time.&lt;br /&gt;
===Isn&#039;t the single metadata server a bottleneck?===&lt;br /&gt;
&lt;br /&gt;
Not so far. We regularly perform tests with single directories containing millions of files, and we have many customers with 1,000-node clusters and a single metadata server.&lt;br /&gt;
&lt;br /&gt;
Lustre is carefully designed to place the entire burden of file I/O on the Object Storage Servers (OSSs): locking, disk allocation, storage and retrieval, everything. Once the file is opened and the striping information obtained by the client, the metadata server is no longer involved in the business of file I/O.&lt;br /&gt;
&lt;br /&gt;
The Lustre metadata server software is extremely multithreaded, and we have made substantial modifications to ext3 and the Linux VFS (2.6) to enable fine-grained locking of a single directory. These are the kinds of optimizations which make it possible to do random creations or lookups in a single 10-million-file directory at a rate of more than 5,000 per second.&lt;br /&gt;
&lt;br /&gt;
If there is a customer need for massive metadata servers prior to the release of clustered metadata, it should be possible to scale quite far using large SMP systems (such as the SGI Altix or Bull NovaScale) with large memory spaces which can utilize large caches to speed MDS operations.&lt;br /&gt;
===What is the typical MDS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation.&lt;br /&gt;
=== How do I automate failover of my MDSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery.&lt;br /&gt;
===Do you plan to support MDS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
No. The extreme complexity that this would introduce does not seem to be warranted by the relatively modest cost of a small amount of shared storage for the metadata servers.&lt;br /&gt;
===How is metadata allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
The standard way that Lustre formats the MDS file system is with 256-byte ext3 inodes, which contain extended attributes (EAs) embedded in the inodes. One use of such an EA is for the file striping data, which tells the clients on which object servers to find the file data. For very widely striped files, this EA may be too large to store in the inode and will be stored in separate blocks. By storing the EA in the inode when possible, we avoid an extra very expensive disk seek.&lt;br /&gt;
=== What is intent locking?===&lt;br /&gt;
&lt;br /&gt;
Most file systems operate in one of two modes: a mode in which the server does the metadata modifications, or a mode in which the client can cache metadata updates itself. Both ways have their advantages and disadvantages in certain situations.&lt;br /&gt;
&lt;br /&gt;
Consider the case of 1,000 clients all chdir&#039;ing to /tmp and creating their own output files. If each client locks the directory, adds their file, uploads the modification, and releases the lock, this simple operation will take forever. If the metadata server is able to execute the operations locally and return the results, it should all happen in less than a second.&lt;br /&gt;
&lt;br /&gt;
This is not a contrived example -- Lustre users run applications which do this very thing every hour of every day, for example, to write checkpoint data of a long-running scientific computation.&lt;br /&gt;
&lt;br /&gt;
Consider another very common case, like a user&#039;s home directory being used only by one node. In this case, it would be extremely advantageous to allow that node to cache metadata updates in ram, then lazily propagate them back to the MDS. This allows the user to make updates as fast as they can be recorded in ram (until ram is full).&lt;br /&gt;
&lt;br /&gt;
The moral of the store is: in cases of high concurrency, do the updates on the metadata server. In single-user cases, cache updates on the client.&lt;br /&gt;
&lt;br /&gt;
What does this have to do with intent locking? Our protocol bundles up the information for the entire operation with the initial lock request, in the form of a metadata intent, and gives the metadata server the option of whether to execute the operation immediately (and return only a result code), or to return a lock to allow the client to perform writeback caching.&lt;br /&gt;
&lt;br /&gt;
Lustre 1.x does not include a metadata writeback cache on the client, so today&#039;s metadata server always executes the operation on the client&#039;s behalf. Even without a writeback cache, however, the intent locking infrastructure still provides value. By having all of the information available during the initial lock request, we are able to perform all metadata operations in a single RPC.&lt;br /&gt;
&lt;br /&gt;
=== How does the metadata locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Prior to Lustre 1.4.x, each metadata server inode was locked as a single unit. When the client wished to cache the existence of a name, or the attributes or directory pages of an inode, it would take and hold such a read lock. When the metadata server modified an inode, it would take a write lock on the same.&lt;br /&gt;
&lt;br /&gt;
There are common cases when even one lock per inode is not enough, however. For example, consider the case of creating the file &amp;quot;dir/file&amp;quot;, as would happen during the unpacking of a tar archive. The client would first lookup &amp;quot;dir&amp;quot;, and in doing so take a lock to cache the result. It would then ask the metadata server to create &amp;quot;file&amp;quot; inside of it, which would lock &amp;quot;dir&amp;quot; to modify it, thus yanking the lock back from the client. This &amp;quot;ping-pong&amp;quot; back and forth is unnecessary and very inefficient, so a Lustre 1.4.x release will introduce the separate locking of different parts of the inode (simple existence, directory pages, and attributes).&lt;br /&gt;
=== Does the MDS do any pre-allocation?===&lt;br /&gt;
&lt;br /&gt;
Yes. To enable very fast file creation, the metadata server asks the individual OSTs to pre-create some number of objects, which the MDS can then allocate as file stripes without additional RPCs. These preallocations are replenished asynchronously.&lt;br /&gt;
&lt;br /&gt;
== Object Servers and I/O Throughput ==&lt;br /&gt;
===What levels of throughput should I expect?===&lt;br /&gt;
&lt;br /&gt;
This of course depends on many variables, including the type and number of clients and servers, your network and disk infrastructure, your application&#039;s I/O patterns, tuning, and more. With standard HPC workloads and reasonable (ie, not seek-bound, nor extremely small I/O requests, etc) Lustre has demonstrated up to 90% of the system&#039;s raw I/O bandwidth capability.&lt;br /&gt;
&lt;br /&gt;
With all of those variables in mind, here are some demonstrated single-server results on customer or demonstration installations of various types:&lt;br /&gt;
&lt;br /&gt;
* TCP/IP&lt;br /&gt;
**Single-connected gig-e: 115 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 32-bit OSS: 180 MB/s&lt;br /&gt;
**Dual-NIC gig-e on a 64-bit OSS: 220 MB/s&lt;br /&gt;
**Single-connected 10 gig-e on a 64-bit OSS: 550 MB/s, 1GB/s on woodcrest &lt;br /&gt;
* Quadrics Elan 4&lt;br /&gt;
**Single-rail Elan 4 on a 64-bit OSS: 900 MB/s&lt;br /&gt;
**Triple-rail Elan 4 on an 8-way IA-64 OSS: 2600 MB/s &lt;br /&gt;
* Unoptimized InfiniBand&lt;br /&gt;
**Single-port Infiniband on a 64-bit OSS: 700-900 MB/s &lt;br /&gt;
&lt;br /&gt;
===How fast can a single OSS be?===&lt;br /&gt;
&lt;br /&gt;
Using Lustre 1.4.0, a single 8-way Bull NovaScale IA-64 OSS, DataDirect Networks storage, and 3 rails of Quadrics Elan 4, a single OSS achieved 2.6 GB/s of sustained end-to-end bandwidth from two 16-way IA-64 client nodes.&lt;br /&gt;
&lt;br /&gt;
Also using Lustre 1.4.0, a single-CPU AMD Opteron using 10-gigabit ethernet has been clocked at 550 MB/s.&lt;br /&gt;
===How well does Lustre scale as OSSs are added?===&lt;br /&gt;
&lt;br /&gt;
Configured properly, it will scale linearly. In demonstrations on a production system of up to 104 Lustre OSSs, each connected with a single gigabit ethernet port, the aggregate sustained bandwidth reached 11.1 GB/s.&lt;br /&gt;
===How many clients can each OSS support?===&lt;br /&gt;
&lt;br /&gt;
The number of clients is not usually a factor in choosing how many OSSs to deploy. Please see Sizing.&lt;br /&gt;
===What is a typical OSS node configuration?===&lt;br /&gt;
&lt;br /&gt;
Please see Installation .&lt;br /&gt;
===How do I automate failover of my OSSs?===&lt;br /&gt;
&lt;br /&gt;
Please see Recovery .&lt;br /&gt;
===Do you plan to support OSS failover without shared storage?===&lt;br /&gt;
&lt;br /&gt;
Yes. On the roadmap, these features are the RAID-1 and RAID-5 file I/O features. They will provide redundancy and recoverability in the Lustre object protocol rather than requiring shared storage.&lt;br /&gt;
===How is file data allocated on disk?===&lt;br /&gt;
&lt;br /&gt;
Because the Lustre OSTs mount regular ext3 file systems, you can mount them directly and look at them. If you were to do so, you would see a lot of files with names like &amp;quot;934151&amp;quot;, which are object numbers. Inside each object is a file&#039;s data, or a portion of that file&#039;s data, depending on the striping policy for that file. There is no namespace information stored on the object server at this time.&lt;br /&gt;
&lt;br /&gt;
The allocation of this file data to disk blocks is governed by ext3, although here we have made very substantial improvements. Instead of a long array of individual blocks, Lustre&#039;s ext3 manages file data extents, which can dramatically reduce the amount of this metadata for each file, and therefore the amount of seeking and I/O required to read and write it. We also implemented a new buddy block allocator, which can very quickly and without a lot of searching return very large contiguous disk extents.&lt;br /&gt;
===How does the object locking protocol work?===&lt;br /&gt;
&lt;br /&gt;
Before any file data can be modified, or stored in a client cache, a lock must be taken. Each OST runs a lock server, and manages the locking for the stripes of data which reside on that OST. This has two extremely positive effects:&lt;br /&gt;
&lt;br /&gt;
First, it removes a potential bottleneck of a single lock server. As you add object servers, you also add lock server capacity, in addition to disk capacity and bandwidth, and network bandwidth.&lt;br /&gt;
&lt;br /&gt;
Second, it removes the so-called &amp;quot;split-brain&amp;quot; problem common in clustered systems. If the lock service and I/O service reside on different nodes, it is possible for the communications between them to be disrupted, while clients can still access one or both. In that case, data corruption could result because the locking and I/O would no longer be carefully coordinated.&lt;br /&gt;
&lt;br /&gt;
In the Lustre protocol, if a client requests a lock which conflicts with a lock held by another client, a message is sent to the lock holder asking for the lock to be dropped. Before that lock is dropped, the client must write back any cached modifications, and remove all data from its cache for which it will no longer have a lock. Then, and only then, can it drop the lock.&lt;br /&gt;
&lt;br /&gt;
If a client does not drop its lock in a reasonable amount of time (defined by a configurable timeout value) -- perhaps because it has been powered off, or suffered a hardware failure, or for some other reason -- it is evicted from that OST and will not be allowed to execute any operations until it has reconnected. This allows the remainder of the cluster to continue after a node has failed, after a short pause.&lt;br /&gt;
&lt;br /&gt;
Finally, we have implemented a lock manager extension to optimize the very common case of sampling an object&#039;s attributes while it is being modified. Many users, for example, will want to track the progress of a job by getting a file listing (&amp;quot;ls -l&amp;quot;) in the output directory while the job is writing its data.&lt;br /&gt;
&lt;br /&gt;
Because it is not acceptable to return stale or out-of-date file size information, we must ask the server for a lock on this data. Because we don&#039;t actually need the data -- we just need to know how much there is -- we tell the server that instead of a lock it could simply provide the attributes. This is another case of intent locking . If the file is not being actively modified, then the server will grant a lock so that the client can cache the attributes.&lt;br /&gt;
10. Does Lustre support Direct I/O?&lt;br /&gt;
&lt;br /&gt;
Yes. It locks the data to guarantee cluster-wide consistency, just like normal POSIX I/O.&lt;br /&gt;
===Can these locks be disabled?===&lt;br /&gt;
&lt;br /&gt;
Yes, but:&lt;br /&gt;
&lt;br /&gt;
* It&#039;s only safe to do so when you use direct I/O; otherwise you have data in the caches which is not locked. Once that data is in the cache without a lock, it will not be removed except under memory pressure.&lt;br /&gt;
* In practice, the overhead of these locks has not been shown to be an issue. Databases may or may not be an exception, but in any case, they tend to use direct I/O.&lt;br /&gt;
&lt;br /&gt;
===Do you plan to support T-10 object devices?===&lt;br /&gt;
&lt;br /&gt;
We are in touch with the T-10 committee. It is not clear to us that recovery and lock management implications for cluster file systems will see sufficient attention in the T-10 standard for this proposal to be viable. The goals of the T-10 committee may not, in the end, line up well with the very strong semantic guarantees that Lustre makes.&lt;br /&gt;
===Does Lustre support/require special parallel I/O libraries?===&lt;br /&gt;
&lt;br /&gt;
Lustre supports them, but by no means requires them. We have found equal performance when using standard POSIX I/O calls, the POSIX ADIO driver, or the MPI/IO libraries.&lt;br /&gt;
&lt;br /&gt;
The only useful bit that we will likely support is the MPI/IO extension to allow an application to provide hints about how it would like its output files to be striped.&lt;br /&gt;
&lt;br /&gt;
== Recovery ==&lt;br /&gt;
&lt;br /&gt;
=== How do I configure failover services?===&lt;br /&gt;
&lt;br /&gt;
Typical failover configurations couple two Lustre MDS or OSS nodes in pairs directly to a multi-port disk array. Object servers are typically active/active, with each serving half of the array, while metadata servers must be active/passive. These array devices typically have redundancy internally, to eliminate them as single points of failure.&lt;br /&gt;
&lt;br /&gt;
This does not typically require a fibrechannel switch.&lt;br /&gt;
=== How do I automate failover of my MDSs/OSSs?===&lt;br /&gt;
&lt;br /&gt;
The actual business of automating the decisions about whether a server has failed, and which server should take over the load, is managed by a separate package (our customers have used Red Hat&#039;s Cluster Manager and SuSE&#039;s Heartbeat).&lt;br /&gt;
&lt;br /&gt;
Completely automated failover also requires some kind of programmatically controllable power switch, because the new &amp;quot;active&amp;quot; MDS must be able to completely power off the failed node. Otherwise, there is a chance that the &amp;quot;dead&amp;quot; node could wake up, start using the disk at the same time, and cause massive corruption.&lt;br /&gt;
=== How necessary is failover, really?===&lt;br /&gt;
&lt;br /&gt;
The answer depends on how close to 100% uptime you need to achieve. Failover doesn&#039;t protect against the failure of individual disks -- that is handled by software or hardware RAID. Failover is to handle the failure of an MDS or OSS node as a whole which, in our experience, is not very common.&lt;br /&gt;
&lt;br /&gt;
We would suggest that simple RAID-5 is sufficient for most users, but that the most important production systems should consider failover.&lt;br /&gt;
=== I don&#039;t need failover, and don&#039;t want shared storage. How will this work?===&lt;br /&gt;
&lt;br /&gt;
If Lustre is configured without shared storage for failover, and a server node fails, then a client that tries to use that node will pause until the failed server is returned to operation. After a short delay (a configurable timeout value), applications waiting for those nodes can be aborted with a signal (kill or Ctrl-C), similar to the NFS soft-mount mode.&lt;br /&gt;
&lt;br /&gt;
When the node is returned to service, applications which have not been aborted will continue to run without errors or data loss.&lt;br /&gt;
=== If a node suffers a connection failure, will the node select an alternate route for recovery?===&lt;br /&gt;
&lt;br /&gt;
Yes. If a node has multiple network paths, and one fails, it can continue to use the others.&lt;br /&gt;
=== What are the supported hardware methods for HBA, switch, and controller failover?===&lt;br /&gt;
&lt;br /&gt;
These are supported to the extent supported by the HBA drivers. If arrays with multiple ports are shared by multiple I/O nodes, Lustre offers 100% transparent failover for I/O and metadata nodes. (Applications see a delay, but system calls complete without errors.&lt;br /&gt;
=== Can you describe an example failure scenario, and its resolution?===&lt;br /&gt;
&lt;br /&gt;
Although failures are becoming more rare, it is more likely that a node will hang or timeout rather than crash. If a client node hangs or crashes, usually all other client and server nodes are not affected. Normally such a client is rebooted and rejoins the file system. When server nodes hang, they are commonly restarted, merely causing a short delay to applications which try to use that node. Other server nodes or clients are not usually affected.&lt;br /&gt;
=== How are power failures, disk or RAID controller failures, etc. addressed?===&lt;br /&gt;
&lt;br /&gt;
If I/O to the storage is interrupted AND the storage device guarantees strict ordering of transactions, then the ext3 journal recovery will restore the file system in a few seconds.&lt;br /&gt;
&lt;br /&gt;
If the file system is damaged through device failures, unordered transactions, or a power loss affecting a storage device&#039;s caches, Lustre requires a file system repair. Lustre&#039;s tools will reliably repair any damage it can. It will run in parallel on all nodes, but can still be very time consuming for large file systems.&lt;br /&gt;
&lt;br /&gt;
== OS Support ==&lt;br /&gt;
&lt;br /&gt;
=== Which operating systems are/will be supported?===&lt;br /&gt;
&lt;br /&gt;
There are three ways for a file system client to access a Lustre volume: directly, with a native kernel driver; directly, via a userspace library (liblustre); or indirectly, via an NFS or CIFS export.&lt;br /&gt;
&lt;br /&gt;
Today, native kernel drivers exist only for Linux (2.4 and 2.6). Other native ports, such as to Windows, AIX, or Solaris, are being considered. Your feedback about the desirability of such ports would be appreciated.&lt;br /&gt;
&lt;br /&gt;
liblustre is not yet as robust or well-tested as Lustre in the Linux kernel, but it is fundamentally almost entirely the same code. The Cray Red Storm supercomputer (2005) will run liblustre on the Catamount operating system to provide all client and server storage, including root FS and swap space -- so very serious use of liblustre is right around the corner. It could in principle be used on any Unix-like operating system, or even Windows.&lt;br /&gt;
&lt;br /&gt;
CIFS or NFS export will be the solution of choice for most of our customers wishing to integration non-Linux platforms in the short term.&lt;br /&gt;
&lt;br /&gt;
We currently support only Linux as Lustre server nodes. Support for other operating systems such as Solaris will be available in late 2008.&lt;br /&gt;
&lt;br /&gt;
Beginning with Lustre 1.6.0, Linux 2.4 servers will no longer be supported, and with 1.8 clients with 2.4 kernels will not be supported.&lt;br /&gt;
&lt;br /&gt;
=== Why has you decided to patch the Linux kernel?===&lt;br /&gt;
&lt;br /&gt;
Lustre&#039;s goals are extremely ambitious; there are few, if any, other systems which attempt such scalability, performance, and consistency guarantees in a single package. Because this hadn&#039;t been done before, the infrastructure was not present in the kernel for such a file system.&lt;br /&gt;
&lt;br /&gt;
The majority of the changes were in the VFS layer, implementing an API extension to make intent locking possible. Another very substantial set of changes were made to ext3, to make it more scalable and performant. Some extra symbols need to be exported.&lt;br /&gt;
&lt;br /&gt;
=== Are there plans to get these patches into the kernel.org/OSDL kernel?===&lt;br /&gt;
&lt;br /&gt;
Yes. The Lustre patches have been extensively reviewed by the Linux kernel community and vendors. Many of the changes are already present in Linux 2.6, and our contributions to ext3 have been delivered upstream (although many are not yet incorporated).&lt;br /&gt;
&lt;br /&gt;
We have made several changes based on the feedback from these reviews, and Lustre 1.8 will be based on this new set of patches. It is not possible to switch to these patches until we no longer support Linux 2.4, because many of these changes are not practically possible in Linux 2.4.&lt;br /&gt;
&lt;br /&gt;
Based on the reaction from the kernel community principals, and the customers putting pressure on vendors to ship Lustre, we expect that these patches will be incorporated into the kernel.org tree before the release of Lustre 1.8. We will continue to make reasonable changes which may be necessary to conform to the Linux kernel.&lt;br /&gt;
=== Can I run Lustre without patching my kernel?===&lt;br /&gt;
&lt;br /&gt;
Certainly -- you can run SuSE Linux Enterprise Server.&lt;br /&gt;
&lt;br /&gt;
SLES 9 includes a kernel that supports Lustre out of the box, and SLES 9 service pack 2 supports the Lustre 1.4.x series. Our close partnership with Novell ensures that there is always a vendor-supported kernel for the latest version of Lustre.&lt;br /&gt;
&lt;br /&gt;
=== Which Linux kernels are supported?===&lt;br /&gt;
&lt;br /&gt;
We currently support three kernels: for Linux 2.4, the Red Hat Enterprise Linux 3 kernel (based on 2.4.21); for Linux 2.6, the SuSE Linux Enterprise Server 9 kernel (based on 2.6.5),the SuSE Linux Enterprise Server 10 kernel (based on 2.6.16), and the Red Hat Enterprise Linux 4 kernel (based on 2.6.16). Enterprise support customers can download complete binary packages for many architectures.&lt;br /&gt;
=== Which Linux distributions are supported?===&lt;br /&gt;
&lt;br /&gt;
Because Lustre runs almost entirely in the kernel, there are practically no distribution-specific issues.&lt;br /&gt;
&lt;br /&gt;
Lustre enterprise support customers can download packages which have been tested on RHEL3, RHEL4, SLES9 and SLES10 systems, but are likely to work elsewhere. You are welcome to use whichever distribution you wish, although we do ask that you use one of the three Lustre-supported kernels.&lt;br /&gt;
&lt;br /&gt;
=== What if I don&#039;t run one of those kernels?===&lt;br /&gt;
&lt;br /&gt;
If you can&#039;t or won&#039;t run one of the supported kernels, then there is not much that we can do for you. It&#039;s not a matter of &amp;quot;just building from source&amp;quot;, because you need patches for your particular kernel, and given the differences between most kernels, these are fairly non-trivial to port. Until there is a customer demand for a given kernel series, Lustre group simply does not have the resources to maintain those patches.&lt;br /&gt;
&lt;br /&gt;
We expect that this will no longer be an issue in the timeframe of Lustre 1.8, and in the meantime we will support the kernels that our customers use.&lt;br /&gt;
&lt;br /&gt;
=== Do you support Lustre on an SGI Altix?===&lt;br /&gt;
&lt;br /&gt;
We expect that, if you use the SLES9 kernel, it should pretty much work out of the box.&lt;br /&gt;
&lt;br /&gt;
=== When was Lustre for Linux first used in production?===&lt;br /&gt;
&lt;br /&gt;
A pre-1.0 version of Lustre for Linux was first used in a production cluster environment in March 2003.&lt;br /&gt;
&lt;br /&gt;
== Release Testing and Upgrading ==&lt;br /&gt;
=== How does the Lustre group fix issues? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre group approaches bug tracking and fixing seriously and methodically:&lt;br /&gt;
&lt;br /&gt;
* Regression testing: A test is written to reproduce the problem, which is added to the ongoing test suite.&lt;br /&gt;
* Architecture and Design: Depending on the severity and invasiveness, an update to the architecture description may be written and reviewed by senior management. A detailed design description for the patch is written and reviewed by principal engineers.&lt;br /&gt;
* Implementation: Fixes are implemented according to the design description, and added to a bug for review and inspection.&lt;br /&gt;
* Review and Inspection: A developer or development team will review the code first, and then submit it for a methodical inspection by senior and principal engineers.&lt;br /&gt;
* Testing: The developer runs a small suite of tests before the code leaves his or her desk, then it&#039;s added to a branch for regression testing. Ongoing test results are available at https://buffalo.lustre.org/&lt;br /&gt;
&lt;br /&gt;
=== What testing does each version undergo prior to release? ===&lt;br /&gt;
&lt;br /&gt;
Sun and its vendor and customer partners run a large suite of tests on a number of systems, architectures, kernels, and interconnects, including clusters as large as 400 nodes. Major updates receive testing on the largest clusters available to us, around 1,000 nodes.&lt;br /&gt;
===Are Lustre releases backwards and forward compatible on the disk? On the wire?===&lt;br /&gt;
&lt;br /&gt;
Special care is taken to ensure that any disk format changes -- which are rare to begin with -- are handled transparently in previous and subsequent releases. Before the disk format changes, we release versions which support the new format, so you can safely roll back in case of problems. After the format change, new versions continue to support the old formats for some time, and transparently update disk structures when old versions are encountered.&lt;br /&gt;
&lt;br /&gt;
So far, the same care has not been taken for wire structures, because we&#039;ve been adding features so quickly that it hasn&#039;t been practical.&lt;br /&gt;
&lt;br /&gt;
Some people ask &amp;quot;why don&#039;t you just have separate handlers for different versions of the protocol?&amp;quot; but it&#039;s unfortunately not that simple. The protocol rarely changes for aesthetic reasons, but rather because some part of the underlying infrastructure has been changed or extended in a substantive way. It is often not possible to trivially map the old behaviour to the new behaviour, or to do so in a way that preserves proper semantics.&lt;br /&gt;
&lt;br /&gt;
There will come a time when we make that effort, but there is a finite pool of resources, so we will rely on our customers to tell us when that work is more important than adding features or fixing bugs. Beginning with Lustre 1.4.0, we will make clear in our release notes when the wire protocol changes, and with which versions it is backwards-compatible, if any.&lt;br /&gt;
&lt;br /&gt;
=== Do you have to reboot to upgrade? ===&lt;br /&gt;
&lt;br /&gt;
Not unless you upgrade your kernel. It&#039;s usually a simple matter of unmounting the file system or stopping the server, as the case may be, installing the new RPMs, and restarting it.&lt;br /&gt;
&lt;br /&gt;
Some of our customers upgrade servers between wire-compatible releases using failover; a service is failed over, the software is updated on the stopped node, the service is failed back, and the failover partner is upgraded in the same way.&lt;br /&gt;
&lt;br /&gt;
== Licensing and Support ==&lt;br /&gt;
&lt;br /&gt;
=== What is the licensing model for the Lustre file system for Linux? ===&lt;br /&gt;
&lt;br /&gt;
The Lustre file system for Linux is an Open Source product.&lt;br /&gt;
&lt;br /&gt;
New releases are made available to the general public at the same time as to our paying customers and partners, under the terms and conditions of the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
As we develop ports for operating systems other than Linux, it is highly likely that these will be proprietary. We may also decide to develop some future features as proprietary add-ons (outlined in red on the roadmap. These are easily separable, non-core features primarily of interest to enterprise customers.&lt;br /&gt;
&lt;br /&gt;
Virtually every Open Source company has gone out of business or had to change to a proprietary model; this is our way of balancing the realities of business with the desire to provide an excellent Open Source cluster file system for Linux.&lt;br /&gt;
&lt;br /&gt;
=== How can you add proprietary features? If you release part of Lustre under the GPL, you have to distribute the code for the entire thing. ===&lt;br /&gt;
&lt;br /&gt;
Sun owns the copyright to virtually 100% of the relevant Lustre source code, and has a very liberal license to the remainder. This means that we can distribute Lustre under whichever license (or licenses) we choose.&lt;br /&gt;
&lt;br /&gt;
=== If public money helped fund the development of Lustre, don&#039;t the taxpayers own that code?===&lt;br /&gt;
&lt;br /&gt;
Some portions of Lustre were developed under the sponsorship of the US Government under Subcontract nos. B514193, B525177, B523817, B536384, 2204-10713, and others. Under the terms of those subcontracts, Sun retained the copyright to all software developed. We released all of this software to the US Government under the GNU GPL.&lt;br /&gt;
&lt;br /&gt;
Many of these efforts were to produce prototypes or beta implementations, not production-quality software. Sun has invested considerable resources in productizing these features, and developing new features entirely on our own.&lt;br /&gt;
&lt;br /&gt;
If Sun goes out of business, the taxpayers&#039; money will have been wasted. To that end, our government sponsors strongly encourage us to build a sustainable business model.&lt;br /&gt;
&lt;br /&gt;
=== How does the commercial distribution of Lustre work?===&lt;br /&gt;
&lt;br /&gt;
Our Lustre enterprise support customers receive unlimited access to the Lustre technical support team, and through them, the Lustre developers. Support contracts are priced according to the number of Lustre clients and servers, with discounts that grow as the cluster size increases.&lt;br /&gt;
&lt;br /&gt;
For more information about Lustre support contracts, please contact sales@clusterfs.com. This e-mail address is being protected from spam bots, you need JavaScript enabled to view it.&lt;br /&gt;
&lt;br /&gt;
=== Will Sun develop custom features for a proprietary product?===&lt;br /&gt;
&lt;br /&gt;
You can ask, but generally speaking, no. There are several potential reasons for this:&lt;br /&gt;
&lt;br /&gt;
Ideological: We want to provide a fully-featured cluster file system for Linux which is available to everyone. If some features were developed for your proprietary product, it undermines that effort.&lt;br /&gt;
&lt;br /&gt;
Pragmatic: We don&#039;t have the resources to test and support very many different versions. It&#039;s best for us if everyone runs software with the same features, because it very substantially reduces the release engineering burden.&lt;br /&gt;
&lt;br /&gt;
Selfish: It would be unfortunate for us if the code that we developed for you, under a proprietary license, was code that we later wanted to use in a different way. We&#039;d rather not paint ourselves into that corner.&lt;br /&gt;
&lt;br /&gt;
===Which Lustre support services are available?===&lt;br /&gt;
&lt;br /&gt;
Sun provides worldwide, 24/7 enterprise support services for the Lustre file system, with guaranteed response times as low as one hour. You can contact us via telephone, email, or our web site, and speak directly to the experts who designed and implemented the file system.&lt;br /&gt;
&lt;br /&gt;
Sun also supports many partners and their products with licensing and support services. For partners with demonstrated expertise and Lustre support experience, substantial discounts are available.&lt;br /&gt;
&lt;br /&gt;
Contract development of new features, or acceleration of roadmap features to a guaranteed delivery date, are possible.&lt;br /&gt;
&lt;br /&gt;
Sun also provides public and private on-site training services for your system administrators or support staff.&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Change_Log_1.6&amp;diff=4217</id>
		<title>Change Log 1.6</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Change_Log_1.6&amp;diff=4217"/>
		<updated>2008-01-28T04:01:38Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* [http://wiki.lustre.org/index.php?title=Change_Log_1.4 change log 1.4] */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Changes from v1.6.4.1 to v1.6.4.2=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to problems with nested symlinks and FMODE_EXEC (bug 12652), we do not recommend using patchless RHEL4 clients with kernels prior to 2.6.9-55EL (RHEL4U5).&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.4-cfs1&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;RHEL 4 (patched) and RHEL 5/SLES 10 (patchless) clients behave differently on &#039;cd&#039; to a removed cwd &amp;quot;./&amp;quot; (refer to Bugzilla 14399).&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: only for relatively new filesystems, when OSTs are in recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14631  14631 ]&lt;br /&gt;
&lt;br /&gt;
Description: OST objects below id 20000 are deleted, causing data loss &lt;br /&gt;
&lt;br /&gt;
Details: For relatively newly formatted OST filesystem(s), where there have not been at least 20000 objects created on an OST a bug in MDS-&amp;gt;OST orphan recovery could cause those objects to be deleted if the OST was in recovery, but the MDS was not. Safety checks in the orphan recovery prevent this if more than 20000 objects were ever created on an OST. If the MDS was also in recovery the problem was not hit. Only in 1.6.4.1. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare, depends on device drivers and load&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14529 14529] &lt;br /&gt;
&lt;br /&gt;
Description: MDS or OSS nodes crash due to stack overflow &lt;br /&gt;
&lt;br /&gt;
Details: Code changes in 1.6.4 increased the stack usage of some functions. In some cases, in conjunction with device drivers that use a lot of stack the MDS (or possibly OSS) service threads could overflow the stack. One change which was identified to consume additional stack has been reworked to avoid the extra stack usage. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.4 to v1.6.4.1=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14433 14433] &lt;br /&gt;
&lt;br /&gt;
Description: Oops on connection from 1.6.3 client &lt;br /&gt;
&lt;br /&gt;
Frequency: always, on connection from 1.6.3 client &lt;br /&gt;
&lt;br /&gt;
Details: Enable and accept the OBD_CONNECT_LRU_RESIZE flag only if LRU resizing is enabled at configure time. This fixes an oops caused by incorrectly accepting the LRU_RESIZE feature even if --enable-lru-resize is not specified. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.3 to v1.6.4=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11686 11686] &lt;br /&gt;
&lt;br /&gt;
Description: Console message flood &lt;br /&gt;
&lt;br /&gt;
Details: Make cdls ratelimiting more tunable by adding several tunable in procfs /proc/sys/lnet/console_{min,max}_delay_centisecs and /proc/sys/lnet/console_backoff. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13521 13521] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patches for SLES10 2.6.16.53-0.8. &lt;br /&gt;
&lt;br /&gt;
Details: Update which_patch &amp;amp; target file for SLES10 latest kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13128 13128] &lt;br /&gt;
&lt;br /&gt;
Description: add --type and --size parameters to lfs find &lt;br /&gt;
&lt;br /&gt;
Details: Enhance lfs find by adding filetype and filesize parameters. Also multiple OBDs can now be specified for the --obd option. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11270 11270] &lt;br /&gt;
&lt;br /&gt;
Description: eliminate client locks in face of contention &lt;br /&gt;
&lt;br /&gt;
Details: file contention detection and lockless i/o implementation for contended files. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12411 12411] &lt;br /&gt;
&lt;br /&gt;
Description: Remove client patches from SLES 10 kernel. &lt;br /&gt;
&lt;br /&gt;
Details: This causes SLES 10 clients to behave as patchless clients even on a Lustre-patched (server) kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=2369 2369 ]&lt;br /&gt;
&lt;br /&gt;
Description: use i_size_read and i_size_write in 2.6 port &lt;br /&gt;
&lt;br /&gt;
Details: replace inode-&amp;gt;i_size access with i_size_read/write() &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13454 13454] &lt;br /&gt;
&lt;br /&gt;
Description: Add jbd statistics patch for RHEL5 and 2.6.18-vanilla. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13518 13518] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL4 2.6.9-55.0.6. &lt;br /&gt;
&lt;br /&gt;
Details: Modify vm-tunables-rhel4.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13452 13452] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel config for 2.6.18-vanilla. &lt;br /&gt;
&lt;br /&gt;
Details: Modify targets/2.6-vanilla.target.in. Add config file kernel-2.6.18-2.6-vanilla-i686.config. Add config file kernel-2.6.18-2.6-vanilla-i686-smp.config. Add config file kernel-2.6.18-2.6-vanilla-x86_64.config. Add config file kernel-2.6.18-2.6-vanilla-x86_64-smp.config. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13207 13207] &lt;br /&gt;
&lt;br /&gt;
Description: adapt the lustre_config script to support the upgrade case &lt;br /&gt;
&lt;br /&gt;
Details: Add &amp;quot;-u&amp;quot; option for lustre_config script to support upgrading 1.4 server targets to 1.6 in parallel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13751 13751] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL5 2.6.18-8.1.14.el5. &lt;br /&gt;
&lt;br /&gt;
Details: Modify target file &amp;amp; which_patch. A flaw was found in the IA32 system call emulation provided on AMD64 and Intel 64 platforms. An improperly validated 64-bit value could be stored in the %RAX register, which could trigger an out-of-bounds system call table access. An untrusted local user could exploit this flaw to run code in the kernel (ie a root privilege escalation). (CVE-2007-4573). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13748 13748] &lt;br /&gt;
&lt;br /&gt;
Description: Update RHEL 4 kernel to fix local root privilege escalation. &lt;br /&gt;
&lt;br /&gt;
Details: Update to the latest RHEL 4 kernel to fix the vulnerability described in CVE-2007-4573. This problem could allow untrusted local users to gain root access. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14353 14353] &lt;br /&gt;
&lt;br /&gt;
Description: excessive CPU consumption on client reduces IO performance &lt;br /&gt;
&lt;br /&gt;
Details: in some cases the ldlm_poold thread is spending too much time trying to cancel locks, and is cancelling them too aggressively and this can severely impact IO performance. Disable the dynamic LRU resize code at build time. It can be re-enabled with configure --enable-lru-resize at build time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13917 13917] &lt;br /&gt;
&lt;br /&gt;
Description: MDS hang or stay in waiting lock &lt;br /&gt;
&lt;br /&gt;
Details: If client receive lock with CBPENDING flag ldlm need send lock cancel as separate rpc, to avoid situation when cancel request can&#039;t processed due all i/o threads stay in wait lock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11710 11710] &lt;br /&gt;
&lt;br /&gt;
Description: improve handling recoverable errors &lt;br /&gt;
Details: If request processed with error which can be recoverable on server request should be resend, otherwise page released from cache and marked as error. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12302 12302] &lt;br /&gt;
&lt;br /&gt;
Description: new userspace socklnd &lt;br /&gt;
&lt;br /&gt;
Details: Old userspace tcpnal that resided in lnet/ulnds/socklnd replaced with new one - usocklnd. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13730 13730] &lt;br /&gt;
&lt;br /&gt;
Description: Do not fail import if osc_interpret_create gets -EAGAIN &lt;br /&gt;
&lt;br /&gt;
Details: If osc_interpret_create got -EAGAIN it immediately exits and wakeup oscc_waitq. After wakeup oscc_wait_for_objects call oscc_has_objects and see OSC has no objests and call oscc_internal_create to resend create request. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when removing large files&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13181 13181] &lt;br /&gt;
&lt;br /&gt;
Description: scheduling issue during removal of large Lustre files &lt;br /&gt;
&lt;br /&gt;
Details: Don&#039;t take the BKL in fsfilt_ext3_setattr() for 2.6 kernels. It causes scheduling issues when removing large files (17TB in the present case). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13358 13358] &lt;br /&gt;
&lt;br /&gt;
Description: 1.4.11 Can&#039;t handle directories with stripe set and extended ACLs &lt;br /&gt;
&lt;br /&gt;
Details: Impossible (EPROTO is returned) to access a directory that has a non-default striping and ACLs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only on ppc&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12234 12234] &lt;br /&gt;
&lt;br /&gt;
Description: /proc/fs/lustre/devices broken on ppc &lt;br /&gt;
&lt;br /&gt;
Details: The patch as applied to 1.6.2 doesn&#039;t look correct for all arches. We should make sure the type of &#039;index&#039; is loff_t and then cast explicitly as needed below. Do not assign an explicitly cast loff_t to an int. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for rhel5&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13616 13616] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL5 2.6.18-8.1.10.el5. &lt;br /&gt;
&lt;br /&gt;
Details: Modify the target file &amp;amp; which_kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: if the uninit_groups feature is enabled on ldiskfs&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13706 13706] &lt;br /&gt;
&lt;br /&gt;
Description: e2fsck reports &amp;quot;invalid unused inodes count&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: If a new ldiskfs filesystem is created with the &amp;quot;uninit_groups&amp;quot; feature and only a single inode is created in a group then the &amp;quot;bg_unused_inodes&amp;quot; count is incorrectly updated. Creating a second inode in that group would update it correctly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only if filesystem is inconsistent&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11673 11673] &lt;br /&gt;
&lt;br /&gt;
Description: handle &amp;quot;serious error: objid * already exists&amp;quot; more gracefully &lt;br /&gt;
&lt;br /&gt;
Details: If LAST_ID value on disk is smaller than the objects existing in the O/0/d* directories, it indicates disk corruption and causes an LBUG(). If the object is 0-length, then we should use the existing object. This will help to avoid a full fsck in most cases. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rarely&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13570 13570] &lt;br /&gt;
&lt;br /&gt;
Description: To avoid grant space &amp;gt; avaible space when the disk is almost full. Without this patch you might see the error &amp;quot;grant XXXX &amp;gt; available&amp;quot; or some LBUG about grant, when the disk is almost full. &lt;br /&gt;
&lt;br /&gt;
Details: In filter_check_grant, for non_grant cache write, we should check the left space by if (*left &amp;gt; ungranted + bytes), instead of (*left &amp;gt; ungranted), because only we are sure the left space is enough for another &amp;quot;bytes&amp;quot;, then the ungrant space should be increase. In client, we should update cl_avail_grant only there is OBD_MD_FLGRANT in the reply. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when using O_DIRECT and quotas&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13930 13930] &lt;br /&gt;
&lt;br /&gt;
Description: Incorrect file ownership on O_DIRECT output files &lt;br /&gt;
&lt;br /&gt;
Details: block usage reported by &#039;lfs quota&#039; does not take into account files that have been written with O_DIRECT. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13976 13976] &lt;br /&gt;
&lt;br /&gt;
Description: touch file failed when fs is not full &lt;br /&gt;
&lt;br /&gt;
Details: OST in recovery should not be discarded by MDS in alloc_qos(), otherwise we can get ENOSP while fs is not full. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13805 13805] &lt;br /&gt;
&lt;br /&gt;
Description: data checksumming impacts single node performance &lt;br /&gt;
&lt;br /&gt;
Details: disable checksums by default since it impacts single node performance. It is still possible to enable checksums by default via &amp;quot;configure --enable-checksum&amp;quot;, or at runtime via procfs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: when lov objid is destroyed&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14222 14222] &lt;br /&gt;
&lt;br /&gt;
Description: mds can&#039;t recreate lov objid file. &lt;br /&gt;
&lt;br /&gt;
Details: if lov objid file is destroyed and ost with highest index connected first mds not get last objid number from ost. Also if mds get last id from ost his not tell osc about this and it&#039;s produce warning about wrong del orphan request. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rarely&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12948 12948] &lt;br /&gt;
&lt;br /&gt;
Description: buffer overruns could theoretically occur &lt;br /&gt;
&lt;br /&gt;
Details: llapi_semantic_traverse() modifies the &amp;quot;path&amp;quot; argument by appending values to the end of the origin string, and a buffer overrun may occur. Adding buffer overrun check in liblustreapi. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13732 13732] &lt;br /&gt;
&lt;br /&gt;
Description: change order of libsysio includes &lt;br /&gt;
&lt;br /&gt;
Details: &#039;#include sysio.h&#039; should always come before &#039;#include xtio.h&#039; &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.2 to v1.6.3=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.2.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18-8.1.8.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12192 12192] &lt;br /&gt;
&lt;br /&gt;
Description: llapi_file_create() does not allow some changes &lt;br /&gt;
&lt;br /&gt;
Details: add llapi_file_open() that allows specifying the file creation mode and open flags, and also returns an open file handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12743 12743] &lt;br /&gt;
&lt;br /&gt;
Description: df doesn&#039;t work properly if diskfs blocksize != 4K &lt;br /&gt;
&lt;br /&gt;
Details: Choose biggest blocksize of OST&#039;s as the LOV&#039;s blocksize. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11248 11248] &lt;br /&gt;
&lt;br /&gt;
Description: merge and cleanup kernel patches. &lt;br /&gt;
&lt;br /&gt;
Details: Remove mnt_lustre_list in vfs_intent-2.6-rhel4.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13039 13039] &lt;br /&gt;
&lt;br /&gt;
Description: RedHat Update kernel for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Kernel config file for RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12446 12446] &lt;br /&gt;
&lt;br /&gt;
Description: OSS needs mutliple precreate threads &lt;br /&gt;
&lt;br /&gt;
Details: Add ability to start more than one create thread per OSS. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13039 13039] &lt;br /&gt;
&lt;br /&gt;
Description: RedHat Update kernel for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Modify the kernel config file more closer RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13360 13360] &lt;br /&gt;
&lt;br /&gt;
Description: Build failure against Centos5 (RHEL5) &lt;br /&gt;
&lt;br /&gt;
Details: Define PAGE_SIZE when it isn&#039;t present. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11401 11401] &lt;br /&gt;
&lt;br /&gt;
Description: client-side metadata stat-ahead during readdir(directory readahead) &lt;br /&gt;
&lt;br /&gt;
Details: perform client-side metadata stat-ahead when the client detects readdir and sequential stat of dir entries therein &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11230 11230] &lt;br /&gt;
&lt;br /&gt;
Description: Tune the kernel for good SCSI performance. &lt;br /&gt;
&lt;br /&gt;
Details: Set the value of /sys/block/{dev}/queue/max_sectors_kb to the value of /sys/block/{dev}/queue/max_hw_sectors_kb in mount_lustre. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: Always for filesystems larger than 2TB on 32-bit systems.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13547 13547] , [https://bugzilla.lustre.org/show_bug.cgi?id=13627 13627] &lt;br /&gt;
&lt;br /&gt;
Description: Data corruption for OSTs that are formatted larger than 2TB on 32-bit servers. &lt;br /&gt;
&lt;br /&gt;
Details: When generating the bio request for lustre file writes the sector number would overflow a temporary variable before being used for the IO. The data reads correctly from Lustre (which will overflow in a similar manner) but other file data or filesystem metadata may be corrupted in some cases. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13236 13236] &lt;br /&gt;
&lt;br /&gt;
Description: TOE Kernel panic by ksocklnd &lt;br /&gt;
&lt;br /&gt;
Details: offloaded sockets provide their own implementation of sendpage, can&#039;t call tcp_sendpage() directly &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13482 13482] &lt;br /&gt;
&lt;br /&gt;
Description: build error &lt;br /&gt;
&lt;br /&gt;
Details: fix typos in gmlnd, ptllnd and viblnd &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12932 12932] &lt;br /&gt;
&lt;br /&gt;
Description: obd_health_check_timeout too short &lt;br /&gt;
&lt;br /&gt;
Details: set obd_health_check_timeout as 1.5x of obd_timeout &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only with quota on the root user&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12223 12223] &lt;br /&gt;
&lt;br /&gt;
Description: mds_obd_create error creating tmp object &lt;br /&gt;
&lt;br /&gt;
Details: When the user sets quota on root, llog will be affected and can&#039;t create files and write files. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12782 12782] &lt;br /&gt;
&lt;br /&gt;
Description: /proc/sys/lnet has non-sysctl entries &lt;br /&gt;
&lt;br /&gt;
Details: Updating dump_kernel/daemon_file/debug_mb to use sysctl variables &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10778 10778] &lt;br /&gt;
&lt;br /&gt;
Description: kibnal_shutdown() doesn&#039;t finish; lconf --cleanup hangs &lt;br /&gt;
&lt;br /&gt;
Details: races between lnd_shutdown and peer creation prevent lnd_shutdown from finishing. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13279 13279] &lt;br /&gt;
&lt;br /&gt;
Description: open files rlimit 1024 reached while liblustre testing &lt;br /&gt;
&lt;br /&gt;
Details: ulnds/socklnd must close open socket after unsuccessful &#039;say hello&#039; attempt. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always on directories with default striping set&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12836 12836] &lt;br /&gt;
&lt;br /&gt;
Description: lfs find on -1 stripe looping in lsm_lmm_verify_common() &lt;br /&gt;
&lt;br /&gt;
Details: Avoid lov_verify_lmm_common() on directory with -1 stripe count. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Always on ia64 patchless client, and possibly others.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12826 12826] &lt;br /&gt;
&lt;br /&gt;
Description: Add EXPORT_SYMBOL check for node_to_cpumask symbol. &lt;br /&gt;
&lt;br /&gt;
Details: This allows the patchless client to be loaded on architectures without this export. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13142 13142] &lt;br /&gt;
&lt;br /&gt;
Description: disorder of journal start and llog_add cause deadlock. &lt;br /&gt;
&lt;br /&gt;
Details: in llog_origin_connect, journal start should happen before llog_add keep the same order as other functions to avoid the deadlock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: occasionally when using NFS&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13030 13030] &lt;br /&gt;
&lt;br /&gt;
Description: &amp;quot;ll_intent_file_open()) lock enqueue: err: -13&amp;quot; with nfs &lt;br /&gt;
&lt;br /&gt;
Details: with NFS, the anon dentry&#039;s parent was set to itself in d_alloc_anon(), so in MDS, we use rec-&amp;gt;ur_fid1 to find the corresponding dentry other than use rec-&amp;gt;ur_name. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Occasionally with failover&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12459 12459] &lt;br /&gt;
&lt;br /&gt;
Description: Client eviction due to failover config &lt;br /&gt;
&lt;br /&gt;
Details: after a connection loss, the lustre client should attempt to reconnect to the last active server first before trying the other potential connections. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with liblustre clients on XT3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12418 12418] &lt;br /&gt;
&lt;br /&gt;
Description: evictions taking too long &lt;br /&gt;
&lt;br /&gt;
Details: allow llrd to evict clients directly on OSTs &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13125 13125] &lt;br /&gt;
&lt;br /&gt;
Description: osts not allocated evenly to files &lt;br /&gt;
&lt;br /&gt;
Details: change the condition to increase offset_idx &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13436 13436] &lt;br /&gt;
&lt;br /&gt;
Description: Only those disconnect error should be returned by rq_status. &lt;br /&gt;
&lt;br /&gt;
Details: In open/enqueue processs, Some errors, which will cause client disconnected, should be returned by rq_status, while other errors should still be returned by intent, then mdc or llite will detect them. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13600 13600] &lt;br /&gt;
&lt;br /&gt;
Description: &amp;quot;lfs find -obd UUID&amp;quot; prints directories &lt;br /&gt;
&lt;br /&gt;
Details: &amp;quot;lfs find -obd UUID&amp;quot; will return all directory names instead of just file names. It is incorrect because the directories do not reside on the OSTs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13596 13596] &lt;br /&gt;
&lt;br /&gt;
Description: MDS hang after unclean shutdown of lots of clients &lt;br /&gt;
&lt;br /&gt;
Details: Never resend AST requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Always, for kernels after 2.6.16&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13304 13304] &lt;br /&gt;
&lt;br /&gt;
Description: Fix warning idr_remove called for id=.. which is not allocated. &lt;br /&gt;
Details: Last kernels save old s_dev before kill super and not allow to restore from callback - restore it before call kill_anon_super. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12186 12186] &lt;br /&gt;
&lt;br /&gt;
Description: Fix errors in lfs documentation &lt;br /&gt;
&lt;br /&gt;
Details: Fixes man pages &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12588 12588] &lt;br /&gt;
&lt;br /&gt;
Description: when mds and osts use different quota unit(32bit and 64bit), quota will be released repeatly. &lt;br /&gt;
&lt;br /&gt;
Details: void sending multiple quota reqs to mds, which will keep the status between the reqs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: cleanup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13532 13532] &lt;br /&gt;
&lt;br /&gt;
Description: rewrite ext2-derived code in llite/dir.c and obdclass/uuid.c &lt;br /&gt;
&lt;br /&gt;
Details: rewrite inherited code (uuid parsing code from ext2 utils and readdir code from ext3) from scratch preserving functionality. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.1 to v1.6.2=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.2.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18-8.1.8.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?&lt;br /&gt;
title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs8&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12786 12786] &lt;br /&gt;
&lt;br /&gt;
Description: lfs setstripe enhancement &lt;br /&gt;
&lt;br /&gt;
Details: Make lfs setstripe understand &#039;k&#039;, &#039;m&#039; and &#039;g&#039; for stripe size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12211 12211] &lt;br /&gt;
&lt;br /&gt;
Description: randomly memory allocation failure util &lt;br /&gt;
&lt;br /&gt;
Details: Make lustre randomly failed allocating memory for testing purpose. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10786 10786] &lt;br /&gt;
&lt;br /&gt;
Description: omit set fsid for export NFS &lt;br /&gt;
&lt;br /&gt;
Details: fix set/restore device id for avoid EMFILE error and mark lustre fs as FS_REQUIRES_DEV for avoid problems with generate fsid. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10595 10595] &lt;br /&gt;
&lt;br /&gt;
Description: Error message improvement. &lt;br /&gt;
&lt;br /&gt;
Details: Merging of two LCONSOLE_ERROR_MSG into one. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12602 12606] &lt;br /&gt;
&lt;br /&gt;
Description: don&#039;t use GFP_* in generic Lustre code. &lt;br /&gt;
&lt;br /&gt;
Details: Use cfs_alloc_* functions and CFS_* flags for code portability. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12333 12333] &lt;br /&gt;
&lt;br /&gt;
Description: obdclass is limited by single OBD_ALLOC(idarray) &lt;br /&gt;
&lt;br /&gt;
Details: replace OBD_ALLOC/OBD_FREE with OBD_VMALLOC/OBD_VFREE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415 ]&lt;br /&gt;
&lt;br /&gt;
Description: updated patchess for new RHEL4 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Fixed ext3-unlink-race.patch per Kalpak&#039;s comment. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13006 13006 ]&lt;br /&gt;
&lt;br /&gt;
Description: warnings with build patchless client with vanila 2.6.19 and up &lt;br /&gt;
&lt;br /&gt;
Details: change old ctl_table style and replace ctl_table/ctl_table_header with cfs_sysctl_table_t/cfs_sysctl_table_header_t &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13093 13093] &lt;br /&gt;
&lt;br /&gt;
Description: O_DIRECT bypasses client statistics. &lt;br /&gt;
&lt;br /&gt;
Details: When running with O_DIRECT I/O, neither the client rpc_stats nor read_ahead_stats were updated. Copied stats section from osc_send_oap_rpc() into async_internal(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13249 13249] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches for SLES9 2.6.5-7.286 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Update target/ChangeLog/which_patch . &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12955 12955] &lt;br /&gt;
&lt;br /&gt;
Description: jbd statistics &lt;br /&gt;
&lt;br /&gt;
Details: Port older jbd statistics patch for sles10 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13360 13360] &lt;br /&gt;
&lt;br /&gt;
Description: Build failure against Centos5 (RHEL5) &lt;br /&gt;
&lt;br /&gt;
Details: Use getpagesize() instead of PAGE_SIZE. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: after network failures&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12769 12769] &lt;br /&gt;
&lt;br /&gt;
Description: Add sync option to mount_lustre.c &lt;br /&gt;
&lt;br /&gt;
Details: Client loses data written to lustre after a network interruption. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: mds/oss recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10800 10800] &lt;br /&gt;
&lt;br /&gt;
Description: llog ctxt is refrenced after it has been freed. &lt;br /&gt;
&lt;br /&gt;
Details: llog ctxt refcount was added to avoide the race between ctxt free and llog recovery process. Each llog user must hold ctxt refcount before it access the llog. And the llog ctxt can only be freed when its refcount is zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for SLES10&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12771 12771] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patch for SLES10 SP1 &lt;br /&gt;
&lt;br /&gt;
Details: Add patch blkdev_tunables-2.6-sles10.patch to 2.6-sles10.series. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11802 11802] &lt;br /&gt;
&lt;br /&gt;
Description: lustre support for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Add support for RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11756 11756] &lt;br /&gt;
&lt;br /&gt;
Description: umount blocks forever on error &lt;br /&gt;
&lt;br /&gt;
Details: In result of wrong using obd_no_recov and obd_force flags client can hand if cancel or some other requests is lost. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Only for SLES&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13177 13177] &lt;br /&gt;
&lt;br /&gt;
Description: sanity_quota fail test_1 &lt;br /&gt;
&lt;br /&gt;
Details: There are multiple occurences of $TSTUSR in SLES&#039;s /etc/group file, which makes TSTID[2] inunique. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9977 9977] &lt;br /&gt;
&lt;br /&gt;
Description: lvbo_init failed for resource with missing objects. &lt;br /&gt;
&lt;br /&gt;
Details: Fix returning error if we do stat for file with missing/corrupted objects and i_size set to all sum of size of all avaible objects. if we to truncate/write to missing object - it is recreated. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: When flocks are used.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13103 13103] &lt;br /&gt;
&lt;br /&gt;
Description: assertion failure in ldlm_cli_enquque_fini for non NULL lock. &lt;br /&gt;
&lt;br /&gt;
Details: Flock locks might destroy just granted lock if it could be merged with another existing flock, this is done in completion handler, so teach ldlm_cli_enquque_fini that this is a valid case for flock locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11974 11974] &lt;br /&gt;
&lt;br /&gt;
Description: reply_lock_interpret crash due to race with it and lock cancel. &lt;br /&gt;
&lt;br /&gt;
Details: Do not replay locks that are being cancelled. Do not reference locks by their address during replay, just by their handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with deactivated OSTs&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11679 11679] &lt;br /&gt;
&lt;br /&gt;
Description: lstripe command fails for valid OST index &lt;br /&gt;
&lt;br /&gt;
Details: The stripe offset is compared to &#039;lov-&amp;gt;desc.ld_tgt_count&#039; instead of lov-&amp;gt;desc.ld_active_tgt_count. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13147 13147] &lt;br /&gt;
&lt;br /&gt;
Description: block reactivating mgc import until all deactivates complete &lt;br /&gt;
&lt;br /&gt;
Details: Fix race when failing back MDT/MGS to itself (testing) &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for Cray XT3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11706 11706] &lt;br /&gt;
&lt;br /&gt;
Description: peer credits not enough on many OST per OSS systems. &lt;br /&gt;
&lt;br /&gt;
Details: Use new lnet way to add credits as we need those for pings and ASTs &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with liblustre&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12790 12790] &lt;br /&gt;
&lt;br /&gt;
Description: Liblustre is not releasing flock locks on file close. &lt;br /&gt;
&lt;br /&gt;
Details: Release flock locks on file close. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for RHEL4&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12839 12839] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patches for kernel-2.6.9-55.0.2.EL &lt;br /&gt;
&lt;br /&gt;
Details: Remove inode-nr_unused-2.6.9-rhel4.patch from 2.6-rhel4.series Update target file and kernel config. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11327 11327 ]&lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(export != NULL) failed in target_handle_connect &lt;br /&gt;
&lt;br /&gt;
Details: Assetion hit is result of rare race between disconnect and connet to same nid. target_handle_connect found old connect cockie and tried to reconnect, but can&#039;t find export for this cockie. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13276 13276] &lt;br /&gt;
&lt;br /&gt;
Description: Oops in read and write path when failing to allocate lock. &lt;br /&gt;
&lt;br /&gt;
Details: Check if lock allocation failed and return error back. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0.1 to v1.6.1=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2 viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.283 (SLES 9), 2.6.9-55.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs8&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel. &lt;br /&gt;
Starting with this release, the ldiskfs backing filesystem required by Lustre is now in its own package, lustre-ldiskfs. This package should be installed. It is versioned separately from Lustre and may be released separately in future.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12194 12194] &lt;br /&gt;
&lt;br /&gt;
Description: add optional extra BUILD_VERSION info &lt;br /&gt;
&lt;br /&gt;
Details: add a new environment variable (namely LUSTRE_VERS) which allows to override the lustre version. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11548 11548] &lt;br /&gt;
&lt;br /&gt;
Description: Add LNET router traceability for debug purposes &lt;br /&gt;
&lt;br /&gt;
Details: If a checksum failure occurs with a router as part of the IO path, the NID of the last router that forwarded the bulk data is printed so it can be identified. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10997 10997] &lt;br /&gt;
&lt;br /&gt;
Description: lfs setstripe use optional parameters instead of postional parameters. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10651 10651] &lt;br /&gt;
&lt;br /&gt;
Description: Nanosecond timestamp support for ldiskfs &lt;br /&gt;
&lt;br /&gt;
Details: The on-disk ldiskfs filesystem has added support for nanosecond resolution timestamps. There is not yet support for this at the Lustre filesystem level. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10768 10768] &lt;br /&gt;
&lt;br /&gt;
Description: 64-bit inode version &lt;br /&gt;
&lt;br /&gt;
Details: : Add a on-disk 64-bit inode version for ext3 to track changes made to the inode. This will be required for version-based recovery. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11563 11563] &lt;br /&gt;
&lt;br /&gt;
Description: Add -o localflock option to simulate old noflock behaviour. &lt;br /&gt;
&lt;br /&gt;
Details: This will achieve local-only flock/fcntl locks coherentness. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11647 11647] &lt;br /&gt;
&lt;br /&gt;
Description: update patchless client &lt;br /&gt;
&lt;br /&gt;
Details: Add support for patchless client with 2.6.20, 2.6.21 and RHEL 5 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10589 10589] &lt;br /&gt;
&lt;br /&gt;
Description: metadata RPC reduction (e.g. for rm performance) &lt;br /&gt;
&lt;br /&gt;
Details: decrease the amount of synchronous RPC between clients and servers by canceling conflicing lock before the operation on the client and packing thier handles into the main operation RPC to server. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12605 12605] &lt;br /&gt;
&lt;br /&gt;
Description: add #ifdef HAVE_KERNEL_CONFIG_H &lt;br /&gt;
&lt;br /&gt;
Details: kernels from 2.6.19 not need include linux/config.h, but add include linux/autoconf.h in commpiler command line. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12764 12764] &lt;br /&gt;
&lt;br /&gt;
Description: patchless client support for 2.6.22 kernel &lt;br /&gt;
&lt;br /&gt;
Details: 2.6.22 has only one visble change, SLAB_CTOR_* constants is removed. In this case we need drop using os depended interface to kmem_cache and use cfs_mem_cache API. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10968 10968] &lt;br /&gt;
&lt;br /&gt;
Description: VFS operations stats tool. &lt;br /&gt;
&lt;br /&gt;
Details: Tool which collects stats by tracking value written in pid, ppid, gid and uses llstat to generate output to plot graph using plot-llstat Updated lustre/utils/Makefile.am Added lustre/utils/ltrack_stats.c &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11039 11039] &lt;br /&gt;
&lt;br /&gt;
Description: 2.6.18 server support (lustre 1.6.1) &lt;br /&gt;
&lt;br /&gt;
Details: Support for 2.6.18 kernels on the server side. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12678 12678] &lt;br /&gt;
&lt;br /&gt;
Description: remove fs_prep_san_write operation and related patches &lt;br /&gt;
&lt;br /&gt;
Details: remove the ext3-san-jdike patches which are no longer useful. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4900 4900] &lt;br /&gt;
&lt;br /&gt;
Description: Async OSC create to avoid the blocking unnecessarily. &lt;br /&gt;
&lt;br /&gt;
Details: If a OST has no remain object, system will block on the creating when need to create a new object on this OST. Now, ways use pre-created objects when available, instead of blocking on an empty osc while others are not empty. If we must block, we block for the shortest possible period of time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11721 11721] &lt;br /&gt;
&lt;br /&gt;
Description: Add printing inode info into message about error in writepage. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11971 11971] &lt;br /&gt;
&lt;br /&gt;
Description: Accessing a block bevice can re-enable I/O when Lustre is tearing down a device. &lt;br /&gt;
&lt;br /&gt;
Details: dev_clear_rdonly(bdev) must be called in kill_bdev() instead of blkdev_put(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only with mballoc3 code and deep extent trees&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12861 12861] &lt;br /&gt;
&lt;br /&gt;
Description: ldiskfs_ext_search_right: bad header in inode: unexpected eh_depth &lt;br /&gt;
&lt;br /&gt;
Details: a wrong check of extent headers in ldiskfs_ext_search_right() can cause the filesystem to be remounted read-only. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13129 13129] &lt;br /&gt;
&lt;br /&gt;
Description: server LBUG when shutting down &lt;br /&gt;
&lt;br /&gt;
Details: Block umount forever until the mount refcount is zero rather than giving up after an arbitrary timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: 2.6.18 servers only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12546 12546 ]&lt;br /&gt;
&lt;br /&gt;
Description: ll_kern_mount() doesn&#039;t release the module reference &lt;br /&gt;
&lt;br /&gt;
Details: The ldiskfs module reference count never drops down to 0 because ll_kern_mount() doesn&#039;t release the module reference. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12470 12470] &lt;br /&gt;
&lt;br /&gt;
Description: server LBUG when using old ost_num_threads parameter &lt;br /&gt;
&lt;br /&gt;
Details: Accept the old ost_num_threads parameter but warn that it is deprecated, and fix an off-by-one error that caused an LBUG. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11722 11722] &lt;br /&gt;
&lt;br /&gt;
Description: Transient SCSI error results in persistent IO issue &lt;br /&gt;
&lt;br /&gt;
Details: iobuf-&amp;gt;dr_error is not reinitialized to 0 between two uses. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: sometimes when underlying device returns I/O errors&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11743 11743] &lt;br /&gt;
&lt;br /&gt;
Description: OSTs not going read-only during write failures &lt;br /&gt;
&lt;br /&gt;
Details: OSTs are not remounted read-only when the journal commit threads get I/O errors because fsfilt_ext3 calls journal_start/stop() instead of the ext3 wrappers. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: SLES10 only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12538 12538] &lt;br /&gt;
&lt;br /&gt;
Description: sanity-quota.sh quotacheck failed: rc = -22 &lt;br /&gt;
&lt;br /&gt;
Details: Quotas cannot be enabled on SLES10. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: liblustre clients only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12229 12229] &lt;br /&gt;
&lt;br /&gt;
Description: getdirentries does not give error when run on compute nodes &lt;br /&gt;
&lt;br /&gt;
Details: getdirentries does not fail when the size specified as an argument is too small to contain at least one entry &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11315 11315] &lt;br /&gt;
&lt;br /&gt;
Description: OST &amp;quot;spontaneously&amp;quot; evicts client; client has imp_pingable == 0 &lt;br /&gt;
&lt;br /&gt;
Details: Due to a race condition, liblustre clients were occasionally evicted incorrectly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: during server recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11203 11203] &lt;br /&gt;
&lt;br /&gt;
Description: MDS failing to send precreate requests due to OSCC_FLAG_RECOVERING &lt;br /&gt;
&lt;br /&gt;
Details: request with rq_no_resend flag not awake l_wait_event if they get a timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11818 11818] &lt;br /&gt;
&lt;br /&gt;
Description: MDS fails to start if a duplicate client export is detected &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases it was possible for a client to connect to an MDS multiple times. Upon recovery the MDS would detect this and fail during startup. Handle this more gracefully. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12477 12477] &lt;br /&gt;
&lt;br /&gt;
Description: Wrong request locking in request set processing &lt;br /&gt;
&lt;br /&gt;
Details: ptlrpc_check_set wrongly uses req-&amp;gt;rq_lock for proctect add to imp_delayed_list, in this place should be used imp_lock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when reconnecting&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant leak when OSC reconnect to OST &lt;br /&gt;
&lt;br /&gt;
Details: When osc reconnect ost, OST(filter) should check whether it should grant more space to client by comparing fed_grant and cl_avail_grant, and return the granted space to client instead of &amp;quot;new granted&amp;quot; space, because client will call osc_init_grant to update the client grant space info. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when client reconnects to OST&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant leak when OSC does a resend and replays bulk write &lt;br /&gt;
&lt;br /&gt;
Details: When osc reconnect to OST, OST(filter) should clear grant info of bulk write request, because the grant info while be sync between OSC and OST when reconnect, and we should ignore the grant info these of resend/replay write req. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant space more than avaiable space sometimes. &lt;br /&gt;
&lt;br /&gt;
Details: When then OST is about to be full, if two bulk writing from different clients came to OST. Accord the avaliable space of the OST, the first req should be permitted, and the second one should be denied by ENOSPC. But if the seconde arrived before the first one is commited. The OST might wrongly permit second writing, which will cause grant space &amp;gt; avaiable space. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when client is evicted&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12371 12371] &lt;br /&gt;
&lt;br /&gt;
Description: Grant might be wrongly erased when osc is evicted by OST &lt;br /&gt;
&lt;br /&gt;
Details: when the import is evicted by server, it will fork another thread ptlrpc_invalidate_import_thread to invalidate the import, where the grant will be set to 0. While the original thread will update the grant it got when connecting. So if the former happened latter, the grant will be wrongly errased because of this race. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12401 12401] &lt;br /&gt;
&lt;br /&gt;
Description: Checking Stale with correct fid &lt;br /&gt;
&lt;br /&gt;
Details: ll_revalidate_it should uses de_inode instead of op_data.fid2 to check whether it is stale, because sometimes, we want the enqueue happened anyway, and op_data.fid2 will not be initialized. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with 2.4 kernel&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12134 12134] &lt;br /&gt;
&lt;br /&gt;
Description: random memory corruption &lt;br /&gt;
&lt;br /&gt;
Details: size of struct ll_inode_info is to big for union inode.u and this can be cause of random memory corruption. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10818 10818] &lt;br /&gt;
&lt;br /&gt;
Description: Memory leak in recovery &lt;br /&gt;
&lt;br /&gt;
Details: Lov_mds_md was not free in an error handler in mds_create_object. It should also check obd_fail before fsfilt_start, otherwise if fsfilt_start return -EROFS,(failover mds during mds recovery). then the req will return with repmsg-&amp;gt;transno = 0 and rc = EROFS. and we met hit the assert LASSERT(req-&amp;gt;rq_reqmsg-&amp;gt;transno == req-&amp;gt;rq_repmsg-&amp;gt;transno) in ptlrpc_replay_interpret. Fcc should be freed no matter whether fsfilt_commit success or not. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11935 11935] &lt;br /&gt;
&lt;br /&gt;
Description: Not check open intent error before release open handle &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases, the open intent error is not checked before release open handle, which may cause ASSERTION(open_req-&amp;gt;rq_transno != 0), because it tries to release the failed open handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12556 12556] &lt;br /&gt;
&lt;br /&gt;
Description: Set cat log bitmap only after create log success. &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases, the cat log bitmap is set too early. and it should be set only after create log success. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12086 12086] &lt;br /&gt;
&lt;br /&gt;
Description: the cat log was not initialized in recovery &lt;br /&gt;
&lt;br /&gt;
Details: When mds(mgs) do recovery, the tgt_count might be zero, so the unlink log on mds will not be initialized until mds post recovery. And also in mds post recovery, the unlink log will initialization will be done asynchronausly, so there will be race between add unlink log and unlink log initialization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12579 12597] &lt;br /&gt;
&lt;br /&gt;
Description: brw_stats were being printed incorrectly &lt;br /&gt;
&lt;br /&gt;
Details: brw_stats were being printed as log2 but all of them were not recorded as log2. Also remove some code duplication arising from filter_tally_{read,write}. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare, only in recovery.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11674 11674] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(req-&amp;gt;rq_type != LI_POISON) failed &lt;br /&gt;
&lt;br /&gt;
Details: imp_lock should be held while iterating over imp_sending_list for prevent destroy request after get timeout in ptlrpc_queue_wait. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12689 12689] &lt;br /&gt;
&lt;br /&gt;
Description: replay-single.sh test 52 fails &lt;br /&gt;
&lt;br /&gt;
Details: A lock&#039;s skiplist need to be cleanup when it being unlinked from its resource list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11737 11737] &lt;br /&gt;
&lt;br /&gt;
Description: Short directio read returns full requested size rather than &lt;br /&gt;
actual amount read. &lt;br /&gt;
&lt;br /&gt;
Details: Direct I/O operations should return actual amount of bytes transferred rather than requested size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12646 12646] &lt;br /&gt;
&lt;br /&gt;
Description: sanity.sh test_77h fails with &amp;quot;test_77h file compare failed&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: test_77h uses a file which was messed by other test case. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12576 12576] &lt;br /&gt;
&lt;br /&gt;
Description: Not Check whether lov_tgts is NULL in some lov functions &lt;br /&gt;
&lt;br /&gt;
Details: Checking whether lov_tgts is NULL in some functions. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11815 11815] &lt;br /&gt;
&lt;br /&gt;
Description: replace obdo_alloc() with OBDO_ALLOC macro &lt;br /&gt;
&lt;br /&gt;
Details: nothing special is done in obdo_alloc() function, and for debugging purpose, it needs to be replaced with macros. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12784 12784] &lt;br /&gt;
&lt;br /&gt;
Description: bad return value and errno from fcntl call &lt;br /&gt;
&lt;br /&gt;
Details: In liblustre API, errno should be a negative value if error happens. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11544 11544] &lt;br /&gt;
&lt;br /&gt;
Description: ptlrpc_check_set() LBUG &lt;br /&gt;
&lt;br /&gt;
Details: In case of positive reply from server and failed client bulk callback after bulk transfer shouldn&#039;t LBUG, but process this request as erroneous. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12696 12696] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(imp-&amp;gt;imp_conn_current) failed &lt;br /&gt;
&lt;br /&gt;
Details: an assertion failure is hit if a client node boots and attempts to mount a lustre filesystem faster than RECONNECT_INTERVAL seconds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for i686&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12695 12695] &lt;br /&gt;
&lt;br /&gt;
Description: 1.4.11 RC1 build fails for RHEL 4, i686 &lt;br /&gt;
&lt;br /&gt;
Details: Fixed config variable for build. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415] &lt;br /&gt;
&lt;br /&gt;
Description: Updated patchess for new RHEL4 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Updated patch inode-nr_unused-2.6.9-rhel4.patch Updated patch jbd-stats-2.6.9.patch Updated patch qsnet-rhel4-2.6.patch Updated patch quota-deadlock-on-pagelock-core.patch Updated patch vfs_intent-2.6-rhel4.patch Updated patch vfs_races-2.6-rhel4.patch Updated series file 2.6-rhel4-titech.series Updated series file 2.6-rhel4.series Updated kernel config files &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12374 12374] &lt;br /&gt;
&lt;br /&gt;
Description: lquota slave complains LBUG when reconnecting with mds or &lt;br /&gt;
failover in mds. &lt;br /&gt;
&lt;br /&gt;
Details: quota slave depends qctxt-&amp;gt;lqc_import to send its quota request. This pointer will be invalid if mds did failover or broke its connect to osts, which leads to LBUG. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when qunit size is too small(less than 20M)&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12588  12588] &lt;br /&gt;
&lt;br /&gt;
Description: write is stopped by improper -EDQUOT &lt;br /&gt;
&lt;br /&gt;
Details: If the master is busy and qunit size is small enough(let&#039;s say 1M), the slave can not get quota from master on time, which will lead slave to trigger a -EQUOTA to client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12629 12629] &lt;br /&gt;
&lt;br /&gt;
Description: Deadlock during metadata tests &lt;br /&gt;
&lt;br /&gt;
Details: in prune_dir_dentries(), shrink_dcache_parent() should not be called with the per-dentry lock held. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: SLES9 only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12744 12744] &lt;br /&gt;
&lt;br /&gt;
Description: Lustre patched kernel for SLES9 SP3 has NR_CPUS set to 8 &lt;br /&gt;
&lt;br /&gt;
Details: set CONFIG_NR_CPUS to 128 instead of 8. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11324 11324] &lt;br /&gt;
&lt;br /&gt;
Description: LDISKFS-fs error (device sdc): ldiskfs_free_blocks &lt;br /&gt;
&lt;br /&gt;
Details: a disk corruption can cause the mballoc code to assert on a double free or other extent corruptions. Handle these with ext3_error() instead of assertions. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13063 13063] &lt;br /&gt;
&lt;br /&gt;
Description: lfsck built against 1.4.x cannot run against 1.6.0 lustre &lt;br /&gt;
&lt;br /&gt;
Details: the definition for OBD_IOC_GETNAME changed in 1.6.0. One of the few external users of this ioctl number is lfsck&#039;s call to llapi_lov_get_uuids() and this caused lfsck to fail at startup. Add the old ioctl number to the handler so both old and new lfsck can work. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11301 11301] &lt;br /&gt;
&lt;br /&gt;
Description: parallel lock callbacks &lt;br /&gt;
&lt;br /&gt;
Details: Instead of sending blocking and completion callbacks as separated requests, adding them to a set and sending in parallel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12417 12417] &lt;br /&gt;
&lt;br /&gt;
Description: Disable most debugging by default &lt;br /&gt;
&lt;br /&gt;
Details: To improve performance, disable most logging (for debug purposes) by default. VFSTRACE, RPCTRACE, and DLMTRACE are now off by default, and HA includes fewer messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11013 11013] &lt;br /&gt;
&lt;br /&gt;
Description: hash tables for lists of nids, connections and uuids &lt;br /&gt;
&lt;br /&gt;
Details: Hash tables noticeably help when a lot of clients connect to a server, to faster identify duplicate connections or reconnects, also to faster find export to evict in manual eviction case. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11190 11190] &lt;br /&gt;
&lt;br /&gt;
Description: Sometimes, when the server evict a client, and the client will not be evicted as soon as possible. &lt;br /&gt;
&lt;br /&gt;
Details: In enqueue req, the error was returned by intent, instead of rq_status which make ptlrpc layer not detect this error, and does not evict the client. So enqueue error should be returned by rq_status. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for SLES9&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12543 12543] &lt;br /&gt;
&lt;br /&gt;
Description: Routinely utilize latest Quadrics drivers in CFS releases &lt;br /&gt;
&lt;br /&gt;
Details: Update patch qsnet-suse-2.6.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for sles10&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12771 12771] &lt;br /&gt;
&lt;br /&gt;
Description: Update patches for SLES 10 SP1 kernel. &lt;br /&gt;
&lt;br /&gt;
Details: Update the patch vfs_intent-2.6-sles10.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12543 12543] &lt;br /&gt;
&lt;br /&gt;
Description: Routinely utilize latest Quadrics drivers in CFS releases &lt;br /&gt;
&lt;br /&gt;
Details: Update patch qsnet-rhel4-2.6.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12975 12975] &lt;br /&gt;
&lt;br /&gt;
Description: Using wrong pointer in osc_brw_prep_request &lt;br /&gt;
&lt;br /&gt;
Details: Access to array[-1] can produce panic if kernel compiled with CONFIG_PAGE_ALLOC enabled &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only in recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13148 13148] &lt;br /&gt;
&lt;br /&gt;
Description: Mark OST as early accessible if his start SYNC. &lt;br /&gt;
&lt;br /&gt;
Details: osc_precreate return flag early accessible if oscc marked as OSCC_FLAG_SYNC_IN_PROGRESS. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13196 13196] &lt;br /&gt;
&lt;br /&gt;
Description: Sometimes precreate code can triger create object on wrong ost &lt;br /&gt;
&lt;br /&gt;
Details: Wrong protected or not not restored variables aftre precreate loop can produce creation object on wrong ost. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: oss recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10800 10800] &lt;br /&gt;
&lt;br /&gt;
Description: llog_commit_thread cleanup should sync with llog_commit_thread &lt;br /&gt;
start &lt;br /&gt;
&lt;br /&gt;
Details: llog_commit_thread_count should be synced between llog_commit start and cleanup, so new llog_commit thread should not be started when llog_commit threads being stopped to avoid accessing some freed stuff. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with 10000 clients or more&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12364 12364] &lt;br /&gt;
&lt;br /&gt;
Description: poor connect scaling with increasing client count &lt;br /&gt;
&lt;br /&gt;
Details: Don&#039;t run filter_grant_sanity_check for more than 100 exports to improve scaling for large numbers of clients. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: nfs export on patchless client&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11970 11970] &lt;br /&gt;
&lt;br /&gt;
Description: connectathon hang when test nfs export over patchless client &lt;br /&gt;
&lt;br /&gt;
Details: Disconnected dentry cannot be found with lookup, so we do not need to unhash it or make it invalid &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11546 11546] &lt;br /&gt;
&lt;br /&gt;
Description: open req refcounting wrong on reconnect &lt;br /&gt;
&lt;br /&gt;
Details: If reconnect happened between getting open reply from server and call to mdc_set_replay_data in ll_file_open, we will schedule replay for unreferenced request that we are about to free. Subsequent close will crash in variety of ways. Check that request is still eligible for replay in mdc_set_replay_data(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11512 11512] &lt;br /&gt;
&lt;br /&gt;
Description: disable writes to filesystem when reading health_check file &lt;br /&gt;
&lt;br /&gt;
Details: the default for reading the health_check proc file has changed to NOT do a journal transaction and write to disk, because this can cause reads of the /proc file to hang and block HA state checking on a healthy but otherwise heavily loaded system. It is possible to return to the previous behaviour during configure with --enable-health-write. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11658 11658 ]&lt;br /&gt;
&lt;br /&gt;
Description: log_commit_thread vs filter_destroy race leads to crash &lt;br /&gt;
&lt;br /&gt;
Details: Take import reference before releasing llog record semaphore &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with huge numbers of clients&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11817 11817] &lt;br /&gt;
&lt;br /&gt;
Description: Prevents from taking the superblock lock in llap_from_page for a soon died page. &lt;br /&gt;
&lt;br /&gt;
Details: using LL_ORIGIN_REMOVEPAGE origin flag instead of LL_ORIGIN_UNKNOW for llap_from_page call in ll_removepage() prevents from taking the superblock lock for a soon died page. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11706 11706] &lt;br /&gt;
&lt;br /&gt;
Description: service threads may hog cpus when there are a lot of requests &lt;br /&gt;
&lt;br /&gt;
Details: Insert cond_resched to give other threads a chance to use some CPU &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12747 12747] &lt;br /&gt;
&lt;br /&gt;
Description: fix mal-formatted messages &lt;br /&gt;
&lt;br /&gt;
Details: fix some mal-formatted DEBUG_REQ and LCONSOLE_ERROR_MSG messages &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always in liblustre&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11737 11737] &lt;br /&gt;
&lt;br /&gt;
Description: wrong IS_ERR implementation in liblustre.h &lt;br /&gt;
&lt;br /&gt;
Details: fix IS_ERR implementation in liblustre.h for right detect errors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10419 10419] &lt;br /&gt;
&lt;br /&gt;
Description: Correct condition for output debug message. &lt;br /&gt;
&lt;br /&gt;
Details: inode i_nlink equal zero is not enough for output message about disk corruption, i_ctime and i_mode should be also checked. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always in patchless client&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415] &lt;br /&gt;
&lt;br /&gt;
Description: add configure check for truncate_complete_page &lt;br /&gt;
&lt;br /&gt;
Details: improve checks for exported symbols. This allow run check without &lt;br /&gt;
sources, but with Module.symvers shipped with kernel distribution. add check for truncate_complete_page used by patchless client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only run on patchless client.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12858 12858] &lt;br /&gt;
&lt;br /&gt;
Description: use do_facet on sanity.sh for test handling recoverables errors &lt;br /&gt;
&lt;br /&gt;
Details: use do_facet instead of direct use sysctl for set fail_loc on OST &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only at startup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11778 11778] &lt;br /&gt;
&lt;br /&gt;
Description: Delay client connections to MDT until fist MDT-&amp;gt;OST connect &lt;br /&gt;
&lt;br /&gt;
Details: If a client tried to create a new file before the MDT had connected to any OSTs, the create would return EIO. Now the client will simply block until the MDT connects to the first OST and the create can succeed. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: at statup only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12860 12860] &lt;br /&gt;
&lt;br /&gt;
Description: mds_lov_synchronize race leads to various problems &lt;br /&gt;
&lt;br /&gt;
Details: simultaneous MDT-&amp;gt;OST connections at startup can cause the sync to abort, leaving the OSC in a bad state. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0 to v1.6.0.1=&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: on some architectures&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12404 12404] &lt;br /&gt;
&lt;br /&gt;
Description: 1.6 client sometimes fails to mount from a 1.4 MDT &lt;br /&gt;
&lt;br /&gt;
Details: Uninitialized flags sometimes cause configuration commands to be skipped. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: patchless clients only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12391 12391] &lt;br /&gt;
&lt;br /&gt;
Description: missing __iget() symbol export &lt;br /&gt;
&lt;br /&gt;
Details: The __iget() symbol export is missing. To avoid the need for this on patchless clients the deathrow inode reaper is turned off, and we depend on the VM to clean up old inodes. This dependency was during via the fix for bug 12181. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12848 12848] &lt;br /&gt;
&lt;br /&gt;
Description: sanity.sh fail: test_52b &lt;br /&gt;
&lt;br /&gt;
Details: The ll_inode_to_ext_flags() has a glitch which makes MDS return incorrect inode&#039;s flags to client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.4.10 to v1.6.0=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;CONFIGURATION CHANGE. This version of Lustre WILL NOT INTEROPERATE with older versions automatically. In many cases a special upgrade step is needed. Please read the user documentation before upgrading any part of a 1.4.x system.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;WARNING: Lustre configuration and startup changes are required with this release. See https://mail.clusterfs.com/wikis/lustre/MountConf for details.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.4.21-47.0.1.EL (RHEL 3), 2.6.5-7.283 (SLES 9), 2.6.9-42.0.10.EL (RHEL 4), 2.6.12.6 vanilla (kernel.org), 2.6.16.27-0.9 (SLES10)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see https://mail.clusterfs.com/wikis/lustre/PatchlessClient) 2.6.16 - 2.6.19 vanilla (kernel.org), 2.6.9-42.0.8EL (RHEL 4)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs6&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4900 4900] &lt;br /&gt;
&lt;br /&gt;
Description: Async OSC create to avoid the blocking unnecessarily. &lt;br /&gt;
&lt;br /&gt;
Details: If a OST has no remain object, system will block on the creating when need to create a new object on this OST. Now, ways use pre-created objects when available, instead of blocking on an empty osc while others are not empty. If we must block, we block for the shortest possible period of time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=8007 8007] &lt;br /&gt;
&lt;br /&gt;
Description: MountConf &lt;br /&gt;
&lt;br /&gt;
Details: Lustre configuration is now managed via mkfs and mount commands instead of lmc and lconf. New obd types (MGS, MGC) are added for dynamic configuration management. See https://mail.clusterfs.com/wikis/lustre/MountConf for details. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4482 4482] &lt;br /&gt;
&lt;br /&gt;
Description: dynamic OST addition &lt;br /&gt;
&lt;br /&gt;
Details: OSTs can now be added to a live filesystem &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9851 9851] &lt;br /&gt;
&lt;br /&gt;
Description: startup order invariance &lt;br /&gt;
&lt;br /&gt;
Details: MDTs and OSTs can be started in any order. Clients only require the MDT to complete startup. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4899 4899] &lt;br /&gt;
&lt;br /&gt;
Description: parallel, asynchronous orphan cleanup &lt;br /&gt;
&lt;br /&gt;
Details: orphan cleanup is now performed in separate threads for each OST, allowing parallel non-blocking operation. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9862 9862] &lt;br /&gt;
&lt;br /&gt;
Description: optimized stripe assignment &lt;br /&gt;
&lt;br /&gt;
Details: stripe assignments are now made based on ost space available, ost previous usage, and OSS previous usage, in order to try to optimize storage space and networking resources. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4226 4226] &lt;br /&gt;
&lt;br /&gt;
Description: Permanently set tunables &lt;br /&gt;
&lt;br /&gt;
Details: All writable /proc/fs/lustre tunables can now be permanently set on a per-server basis, at mkfs time or on a live system. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10547 10547] &lt;br /&gt;
&lt;br /&gt;
Description: Lustre message v2 &lt;br /&gt;
&lt;br /&gt;
Details: Add lustre message format v2. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9866 9866] &lt;br /&gt;
&lt;br /&gt;
Description: client OST exclusion list &lt;br /&gt;
&lt;br /&gt;
Details: Clients can be started with a list of OSTs that should be declared &amp;quot;inactive&amp;quot; for known non-responsive OSTs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10088 10088] &lt;br /&gt;
&lt;br /&gt;
Description: fine-grained SMP locking inside DLM &lt;br /&gt;
&lt;br /&gt;
Details: Improve DLM performance on SMP systems by removing the single per-namespace lock and replace it with per-resource locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9332 9332] &lt;br /&gt;
&lt;br /&gt;
Description: don&#039;t hold multiple extent locks at one time &lt;br /&gt;
&lt;br /&gt;
Details: To avoid client eviction during large writes, locks are not held on multiple stripes at one time or for very large writes. Otherwise, clients can block waiting for a lock on a failed OST while holding locks on other OSTs and be evicted. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9293 9293] &lt;br /&gt;
&lt;br /&gt;
Description: Multiple MD RPCs in flight. &lt;br /&gt;
&lt;br /&gt;
Details: Further unserialise some read-only MDT RPCs - learn about intents. To avoid overly-overloading MDT, introduce a limit on number of MDT RPCs in flight for a single client and add /proc controls to adjust this limit. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22484 22484] &lt;br /&gt;
&lt;br /&gt;
Description: client read/write statistics &lt;br /&gt;
&lt;br /&gt;
Details: Add client read/write call usage stats for performance analysis of user processes. /proc/fs/lustre/llite/*/offset_stats shows non-sequential file access. extents_stats shows chunk size distribution. extents_stats_per_process show chunk size distribution per user process. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22485 22485] &lt;br /&gt;
&lt;br /&gt;
Description: per-client statistics on server &lt;br /&gt;
&lt;br /&gt;
Details: Add ldlm and operations statistics for each client in /proc/fs/lustre/mds|obdfilter/*/exports/ &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22486 22486] &lt;br /&gt;
&lt;br /&gt;
Description: improved MDT statistics &lt;br /&gt;
&lt;br /&gt;
Details: Add detailed MDT operations statistics in /proc/fs/lustre/mds/*/stats &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10968 10968] &lt;br /&gt;
&lt;br /&gt;
Description: VFS operations stats &lt;br /&gt;
&lt;br /&gt;
Details: Add client VFS call stats, trackable by pid, ppid, or gid /proc/fs/lustre/llite/*/stats_track_[pid|ppid|gid] &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=2258 2258] &lt;br /&gt;
&lt;br /&gt;
Description: Dynamic service threads &lt;br /&gt;
&lt;br /&gt;
Details: Within a small range, start extra service threads automatically when the request queue builds up. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11229 11229] &lt;br /&gt;
&lt;br /&gt;
Description: Easy OST removal &lt;br /&gt;
&lt;br /&gt;
Details: OSTs can be permanently deactivated with e.g. &#039;lctl conf_param lustre-OST0001.osc.active=0&#039; &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11335 11335] &lt;br /&gt;
&lt;br /&gt;
Description: MGS proc entries &lt;br /&gt;
&lt;br /&gt;
Details: Added basic proc entries for the MGS showing what filesystems are served. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10998 10998] &lt;br /&gt;
&lt;br /&gt;
Description: provide MGS failover &lt;br /&gt;
&lt;br /&gt;
Details: Added config lock reacquisition after MGS server failover. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11461 11461] &lt;br /&gt;
&lt;br /&gt;
Description: add Linux 2.4 support &lt;br /&gt;
&lt;br /&gt;
Details: Added support for RHEL 2.4.21 kernel for 1.6 servers and clients &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10902 10902] &lt;br /&gt;
&lt;br /&gt;
Description: plain/inodebits lock performance improvement &lt;br /&gt;
&lt;br /&gt;
Details: Grouping plain/inodebits in granted list by their request modes and bits policy, thus improving the performance of search through the granted list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11667 11667] &lt;br /&gt;
&lt;br /&gt;
Description: Add &amp;quot;/proc/sys/lustre/debug_peer_on_timeout&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: liblustre envirable: LIBLUSTRE_DEBUG_PEER_ON_TIMEOUT boolean to control whether to print peer debug info when a client&#039;s RPC times out. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11264 11264] &lt;br /&gt;
&lt;br /&gt;
Description: Add uninit_groups feature to ldiskfs2 to speed up e2fsck &lt;br /&gt;
&lt;br /&gt;
Details: The uninit_groups feature works in conjunction with the kernel filesystem code (ldiskfs2 only) and e2fsprogs-1.39-cfs6 to speed up the pass1 processing of e2fsck. This is a read-only feature in ldiskfs2 only, so older kernels and current ldiskfs cannot mount filesystems that have had this feature enabled. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10816 10816] &lt;br /&gt;
&lt;br /&gt;
Description: Improve multi-block allocation algorithm to avoid fragmentation &lt;br /&gt;
&lt;br /&gt;
Details: The mballoc3 code (ldiskfs2 only) adds new mechanisms to improve allocation locality and avoid filesystem fragmentation. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: mixed-endian client/server environments&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11214 11214] &lt;br /&gt;
&lt;br /&gt;
Description: mixed-endian crashes &lt;br /&gt;
&lt;br /&gt;
Details: The new msg_v2 system had some failures in mixed-endian environments. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: when an incorrect nid is specified during startup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10734 10734] &lt;br /&gt;
&lt;br /&gt;
Description: ptlrpc connect to non-existant node causes kernel crash &lt;br /&gt;
&lt;br /&gt;
Details: LNET can&#039;t be re-entered from an event callback, which happened when we expire a message after the export has been cleaned up. Instead, hand the zombie cleanup off to another thread. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only if OST filesystem is corrupted&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9829 9829] &lt;br /&gt;
&lt;br /&gt;
Description: client incorrectly hits assertion in ptlrpc_replay_req() &lt;br /&gt;
&lt;br /&gt;
Details: for a short time RPCs with bulk IO are in the replay list, but replay of bulk IOs is unimplemented. If the OST filesystem is corrupted due to disk cache incoherency and then replay is started it is possible to trip an assertion. Avoid putting committed RPCs into the replay list at all to avoid this issue. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: liblustre (e.g. catamount) on a large cluster with &amp;gt;= 8 OSTs/OSS&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11684 11684] &lt;br /&gt;
&lt;br /&gt;
Description: System hang on startup &lt;br /&gt;
&lt;br /&gt;
Details: This bug allowed the liblustre (e.g. catamount) client to return to the app before handling all startup RPCs. This could leave the node unresponsive to lustre network traffic and manifested as a server ptllnd timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only for devices with external journals&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10719 10719] &lt;br /&gt;
&lt;br /&gt;
Description: Set external device read-only also &lt;br /&gt;
&lt;br /&gt;
Details: During a commanded failover stop, we set the disk device read-only while the server shuts down. We now also set any external journal device read-only at the same time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: when setting specific ost indicies&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11149 11149] &lt;br /&gt;
&lt;br /&gt;
Description: QOS code breaks on skipped indicies &lt;br /&gt;
&lt;br /&gt;
Details: Add checks for missing OST indicies in the QOS code, so OSTs created with --index need not be sequential. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12123 12123] &lt;br /&gt;
&lt;br /&gt;
Description: ENOENT returned for valid filehandle during dbench. &lt;br /&gt;
&lt;br /&gt;
Details: Check if a directory has children when invalidating dentries associated with an inode during lock cancellation. This fixes an incorrect ENOENT sometimes seen for valid filehandles during testing with dbench. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11330 11330] &lt;br /&gt;
&lt;br /&gt;
Description: a large application tries to do I/O to the same resource and dies in the middle of it. &lt;br /&gt;
&lt;br /&gt;
Details: Check the req-&amp;gt;rq_arrival time after the call to ost_brw_lock_get(), but before we do anything about processing it &amp;amp; sending the BULK transfer request. This should help move old stale pending locks off the queue as quickly as obd_timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: SFS test only (otherwise harmless)&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=6062 6062] &lt;br /&gt;
&lt;br /&gt;
Description: SPEC SFS validation failure on NFS v2 over lustre. &lt;br /&gt;
&lt;br /&gt;
Details: Changes the blocksize for regular files to be 2x RPC size, and not depend on stripe size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=6380 6380] &lt;br /&gt;
&lt;br /&gt;
Description: Fix client-side osc byte counters &lt;br /&gt;
&lt;br /&gt;
Details: The osc read/write byte counters in /proc/fs/lustre/osc/*/stats are now working &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always as root on SLES&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10667 10667] &lt;br /&gt;
&lt;br /&gt;
Description: Failure of copying files with lustre special EAs. &lt;br /&gt;
&lt;br /&gt;
Details: Client side always return success for setxattr call for lustre special xattr (currently only &amp;quot;trusted.lov&amp;quot;). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10345 10345] &lt;br /&gt;
&lt;br /&gt;
Description: Refcount LNET uuids &lt;br /&gt;
&lt;br /&gt;
Details: The global LNET uuid list grew linearly with every startup; refcount repeated list entries instead of always adding to the list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for kernels with patches from Lustre below 1.4.3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11248 11248] &lt;br /&gt;
&lt;br /&gt;
Description: Remove old rdonly API &lt;br /&gt;
&lt;br /&gt;
Details: Remove old rdonly API which unused from at least lustre 1.4.3 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: when upgrading from 1.4 while trying to change parameters&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11692 11692] &lt;br /&gt;
&lt;br /&gt;
Description: The wrong (new) MDC name was used when setting parameters for upgraded MDT&#039;s. Also allows changing of OSC (and MDC) parameters if --writeconf is specified at tunefs upgrade time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0 to v1.4.11=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12014 12014] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION failures when upgrading to the patchless zero-copy socklnd &lt;br /&gt;
&lt;br /&gt;
Details: This bug affects &amp;quot;rolling upgrades&amp;quot;, causing an inconsistent protocol version negotiation and subsequent assertion failure during rolling upgrades after the first wave of upgrades. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10916 10916] &lt;br /&gt;
&lt;br /&gt;
Description: added LNET self test &lt;br /&gt;
&lt;br /&gt;
Details: landing b_self_test &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12316 12316] &lt;br /&gt;
&lt;br /&gt;
Description: Add OFED1.2 support to o2iblnd &lt;br /&gt;
&lt;br /&gt;
Details: o2iblnd depends on OFED&#039;s modules, if out-tree OFED&#039;s modules are installed (other than kernel&#039;s in-tree infiniband), there could be some problem while insmod o2iblnd (mismatch CRC of ib_* symbols). If extra Module.symvers is supported in kernel (i.e, 2.6.17), this link provides solution: https://bugs.openfabrics.org/show_bug.cgi?id=355 if extra Module.symvers is not supported in kernel, we will have to run the script in bug 12316 to update $LINUX/module.symvers before building o2iblnd. More details about this are in bug 12316. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11680 11680] &lt;br /&gt;
&lt;br /&gt;
Description: make panic on lbug configurable &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13288 13288] &lt;br /&gt;
&lt;br /&gt;
Description: Initialize cpumask before use &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11223 11223] &lt;br /&gt;
&lt;br /&gt;
Details: Change &amp;quot;dropped message&amp;quot; CERRORs to D_NETERROR so they are logged instead of creating &amp;quot;console chatter&amp;quot; when a lustre timeout races with normal RPC completion. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Details: lnet_clear_peer_table can wait forever if user forgets to clear a lazy portal. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Details: libcfs_id2str should check pid against LNET_PID_ANY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12227 12227] &lt;br /&gt;
&lt;br /&gt;
Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of struct timeval. &lt;br /&gt;
&lt;br /&gt;
Details: do_div() macro is used incorrectly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.4.9 to v1.6.0=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10316 10316] &lt;br /&gt;
&lt;br /&gt;
Description: Fixed console chatter in case of -ETIMEDOUT. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11684 11684] &lt;br /&gt;
&lt;br /&gt;
Description: Added D_NETTRACE for recording network packet history (initially only for ptllnd). Also a separate userspace ptllnd facility to gather history which should really be covered by D_NETTRACE too, if only CDEBUG recorded history in userspace. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11094 11094] &lt;br /&gt;
&lt;br /&gt;
Description: Multiple instances for o2iblnd &lt;br /&gt;
&lt;br /&gt;
Details: Allow multiple instances of o2iblnd to enable networking over multiple HCAs and routing between them. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12458 12458] &lt;br /&gt;
&lt;br /&gt;
Description: Assertion failure in kernel ptllnd caused by posting passive bulk buffers before connection establishment complete. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12455 12455] &lt;br /&gt;
&lt;br /&gt;
Description: A race in kernel ptllnd between deleting a peer and posting new communications for it could hang communications - manifesting as &amp;quot;Unexpectedly long timeout&amp;quot; messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12432 12432] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel ptllnd lock ordering issue could hang a node. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12016 12016] &lt;br /&gt;
&lt;br /&gt;
Description: node crash on socket teardown race &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: after Ptllnd timeouts and portals congestion&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11659 11659] &lt;br /&gt;
&lt;br /&gt;
Description: Credit overflows &lt;br /&gt;
&lt;br /&gt;
Details: This was a bug in ptllnd connection establishment. The fix implements better peer stamps to disambiguate connection establishment and ensure both peers enter the credit flow state machine consistently. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare &lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11394 11394] &lt;br /&gt;
&lt;br /&gt;
Description: kptllnd didn&#039;t propagate some network errors up to LNET &lt;br /&gt;
&lt;br /&gt;
Details: This bug was spotted while investigating 11394. The fix ensures network errors on sends and bulk transfers are propagated to LNET/lustre correctly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare &lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11616 11616] &lt;br /&gt;
&lt;br /&gt;
Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED. &lt;br /&gt;
&lt;br /&gt;
Details: If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED callback can occur before a connection has actually been established. This caused an assertion failure previously. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11201 11201] &lt;br /&gt;
&lt;br /&gt;
Description: lnet deadlock in router_checker &lt;br /&gt;
&lt;br /&gt;
Details: turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock into BH locks to eliminate potential deadlock caused by ksocknal_data_ready() preempting code holding these locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11126 11126] &lt;br /&gt;
&lt;br /&gt;
Description: Millions of failed socklnd connection attempts cause a very slow FS &lt;br /&gt;
&lt;br /&gt;
Details: added a new route flag ksnr_scheduled to distinguish from ksnr_connecting, so that a peer connection request is only turned down for race concerns when an active connection to the same peer is under progress (instead of just being scheduled). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: gmlnd ignored some transmit errors when finalizing lnet messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11472 11472] &lt;br /&gt;
&lt;br /&gt;
Description: Changed the default kqswlnd ntxmsg=512 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: Ptllnd didn&#039;t init kptllnd_data.kptl_idle_txs before it could be possibly accessed in kptllnd_shutdown. Ptllnd should init kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: the_lnet.ln_finalizing was not set when the current thread is about to complete messages. It only affects multi-threaded user space LNet. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: &#039;lctl peer_list&#039; issued on a mx net&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12237 12237] &lt;br /&gt;
&lt;br /&gt;
Description: Enable lctl&#039;s peer_list for MXLND &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://wiki.lustre.org/index.php?title=Change_Log_1.4 Change logs for 1.4.x releases]=&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Change_Log_1.6&amp;diff=4216</id>
		<title>Change Log 1.6</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Change_Log_1.6&amp;diff=4216"/>
		<updated>2008-01-28T03:58:17Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Changes from v1.6.4.1 to v1.6.4.2=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to problems with nested symlinks and FMODE_EXEC (bug 12652), we do not recommend using patchless RHEL4 clients with kernels prior to 2.6.9-55EL (RHEL4U5).&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.4-cfs1&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;RHEL 4 (patched) and RHEL 5/SLES 10 (patchless) clients behave differently on &#039;cd&#039; to a removed cwd &amp;quot;./&amp;quot; (refer to Bugzilla 14399).&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: only for relatively new filesystems, when OSTs are in recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14631  14631 ]&lt;br /&gt;
&lt;br /&gt;
Description: OST objects below id 20000 are deleted, causing data loss &lt;br /&gt;
&lt;br /&gt;
Details: For relatively newly formatted OST filesystem(s), where there have not been at least 20000 objects created on an OST a bug in MDS-&amp;gt;OST orphan recovery could cause those objects to be deleted if the OST was in recovery, but the MDS was not. Safety checks in the orphan recovery prevent this if more than 20000 objects were ever created on an OST. If the MDS was also in recovery the problem was not hit. Only in 1.6.4.1. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare, depends on device drivers and load&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14529 14529] &lt;br /&gt;
&lt;br /&gt;
Description: MDS or OSS nodes crash due to stack overflow &lt;br /&gt;
&lt;br /&gt;
Details: Code changes in 1.6.4 increased the stack usage of some functions. In some cases, in conjunction with device drivers that use a lot of stack the MDS (or possibly OSS) service threads could overflow the stack. One change which was identified to consume additional stack has been reworked to avoid the extra stack usage. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.4 to v1.6.4.1=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14433 14433] &lt;br /&gt;
&lt;br /&gt;
Description: Oops on connection from 1.6.3 client &lt;br /&gt;
&lt;br /&gt;
Frequency: always, on connection from 1.6.3 client &lt;br /&gt;
&lt;br /&gt;
Details: Enable and accept the OBD_CONNECT_LRU_RESIZE flag only if LRU resizing is enabled at configure time. This fixes an oops caused by incorrectly accepting the LRU_RESIZE feature even if --enable-lru-resize is not specified. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.3 to v1.6.4=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.9.EL (RHEL 4), 2.6.16.53-0.8 (SLES 10), 2.6.18-8.1.14.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11686 11686] &lt;br /&gt;
&lt;br /&gt;
Description: Console message flood &lt;br /&gt;
&lt;br /&gt;
Details: Make cdls ratelimiting more tunable by adding several tunable in procfs /proc/sys/lnet/console_{min,max}_delay_centisecs and /proc/sys/lnet/console_backoff. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13521 13521] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patches for SLES10 2.6.16.53-0.8. &lt;br /&gt;
&lt;br /&gt;
Details: Update which_patch &amp;amp; target file for SLES10 latest kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13128 13128] &lt;br /&gt;
&lt;br /&gt;
Description: add --type and --size parameters to lfs find &lt;br /&gt;
&lt;br /&gt;
Details: Enhance lfs find by adding filetype and filesize parameters. Also multiple OBDs can now be specified for the --obd option. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11270 11270] &lt;br /&gt;
&lt;br /&gt;
Description: eliminate client locks in face of contention &lt;br /&gt;
&lt;br /&gt;
Details: file contention detection and lockless i/o implementation for contended files. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12411 12411] &lt;br /&gt;
&lt;br /&gt;
Description: Remove client patches from SLES 10 kernel. &lt;br /&gt;
&lt;br /&gt;
Details: This causes SLES 10 clients to behave as patchless clients even on a Lustre-patched (server) kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=2369 2369 ]&lt;br /&gt;
&lt;br /&gt;
Description: use i_size_read and i_size_write in 2.6 port &lt;br /&gt;
&lt;br /&gt;
Details: replace inode-&amp;gt;i_size access with i_size_read/write() &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13454 13454] &lt;br /&gt;
&lt;br /&gt;
Description: Add jbd statistics patch for RHEL5 and 2.6.18-vanilla. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13518 13518] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL4 2.6.9-55.0.6. &lt;br /&gt;
&lt;br /&gt;
Details: Modify vm-tunables-rhel4.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13452 13452] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel config for 2.6.18-vanilla. &lt;br /&gt;
&lt;br /&gt;
Details: Modify targets/2.6-vanilla.target.in. Add config file kernel-2.6.18-2.6-vanilla-i686.config. Add config file kernel-2.6.18-2.6-vanilla-i686-smp.config. Add config file kernel-2.6.18-2.6-vanilla-x86_64.config. Add config file kernel-2.6.18-2.6-vanilla-x86_64-smp.config. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13207 13207] &lt;br /&gt;
&lt;br /&gt;
Description: adapt the lustre_config script to support the upgrade case &lt;br /&gt;
&lt;br /&gt;
Details: Add &amp;quot;-u&amp;quot; option for lustre_config script to support upgrading 1.4 server targets to 1.6 in parallel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13751 13751] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL5 2.6.18-8.1.14.el5. &lt;br /&gt;
&lt;br /&gt;
Details: Modify target file &amp;amp; which_patch. A flaw was found in the IA32 system call emulation provided on AMD64 and Intel 64 platforms. An improperly validated 64-bit value could be stored in the %RAX register, which could trigger an out-of-bounds system call table access. An untrusted local user could exploit this flaw to run code in the kernel (ie a root privilege escalation). (CVE-2007-4573). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13748 13748] &lt;br /&gt;
&lt;br /&gt;
Description: Update RHEL 4 kernel to fix local root privilege escalation. &lt;br /&gt;
&lt;br /&gt;
Details: Update to the latest RHEL 4 kernel to fix the vulnerability described in CVE-2007-4573. This problem could allow untrusted local users to gain root access. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14353 14353] &lt;br /&gt;
&lt;br /&gt;
Description: excessive CPU consumption on client reduces IO performance &lt;br /&gt;
&lt;br /&gt;
Details: in some cases the ldlm_poold thread is spending too much time trying to cancel locks, and is cancelling them too aggressively and this can severely impact IO performance. Disable the dynamic LRU resize code at build time. It can be re-enabled with configure --enable-lru-resize at build time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13917 13917] &lt;br /&gt;
&lt;br /&gt;
Description: MDS hang or stay in waiting lock &lt;br /&gt;
&lt;br /&gt;
Details: If client receive lock with CBPENDING flag ldlm need send lock cancel as separate rpc, to avoid situation when cancel request can&#039;t processed due all i/o threads stay in wait lock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11710 11710] &lt;br /&gt;
&lt;br /&gt;
Description: improve handling recoverable errors &lt;br /&gt;
Details: If request processed with error which can be recoverable on server request should be resend, otherwise page released from cache and marked as error. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12302 12302] &lt;br /&gt;
&lt;br /&gt;
Description: new userspace socklnd &lt;br /&gt;
&lt;br /&gt;
Details: Old userspace tcpnal that resided in lnet/ulnds/socklnd replaced with new one - usocklnd. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: occasional&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13730 13730] &lt;br /&gt;
&lt;br /&gt;
Description: Do not fail import if osc_interpret_create gets -EAGAIN &lt;br /&gt;
&lt;br /&gt;
Details: If osc_interpret_create got -EAGAIN it immediately exits and wakeup oscc_waitq. After wakeup oscc_wait_for_objects call oscc_has_objects and see OSC has no objests and call oscc_internal_create to resend create request. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when removing large files&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13181 13181] &lt;br /&gt;
&lt;br /&gt;
Description: scheduling issue during removal of large Lustre files &lt;br /&gt;
&lt;br /&gt;
Details: Don&#039;t take the BKL in fsfilt_ext3_setattr() for 2.6 kernels. It causes scheduling issues when removing large files (17TB in the present case). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13358 13358] &lt;br /&gt;
&lt;br /&gt;
Description: 1.4.11 Can&#039;t handle directories with stripe set and extended ACLs &lt;br /&gt;
&lt;br /&gt;
Details: Impossible (EPROTO is returned) to access a directory that has a non-default striping and ACLs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only on ppc&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12234 12234] &lt;br /&gt;
&lt;br /&gt;
Description: /proc/fs/lustre/devices broken on ppc &lt;br /&gt;
&lt;br /&gt;
Details: The patch as applied to 1.6.2 doesn&#039;t look correct for all arches. We should make sure the type of &#039;index&#039; is loff_t and then cast explicitly as needed below. Do not assign an explicitly cast loff_t to an int. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for rhel5&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13616 13616] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches update for RHEL5 2.6.18-8.1.10.el5. &lt;br /&gt;
&lt;br /&gt;
Details: Modify the target file &amp;amp; which_kernel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: if the uninit_groups feature is enabled on ldiskfs&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13706 13706] &lt;br /&gt;
&lt;br /&gt;
Description: e2fsck reports &amp;quot;invalid unused inodes count&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: If a new ldiskfs filesystem is created with the &amp;quot;uninit_groups&amp;quot; feature and only a single inode is created in a group then the &amp;quot;bg_unused_inodes&amp;quot; count is incorrectly updated. Creating a second inode in that group would update it correctly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only if filesystem is inconsistent&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11673 11673] &lt;br /&gt;
&lt;br /&gt;
Description: handle &amp;quot;serious error: objid * already exists&amp;quot; more gracefully &lt;br /&gt;
&lt;br /&gt;
Details: If LAST_ID value on disk is smaller than the objects existing in the O/0/d* directories, it indicates disk corruption and causes an LBUG(). If the object is 0-length, then we should use the existing object. This will help to avoid a full fsck in most cases. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rarely&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13570 13570] &lt;br /&gt;
&lt;br /&gt;
Description: To avoid grant space &amp;gt; avaible space when the disk is almost full. Without this patch you might see the error &amp;quot;grant XXXX &amp;gt; available&amp;quot; or some LBUG about grant, when the disk is almost full. &lt;br /&gt;
&lt;br /&gt;
Details: In filter_check_grant, for non_grant cache write, we should check the left space by if (*left &amp;gt; ungranted + bytes), instead of (*left &amp;gt; ungranted), because only we are sure the left space is enough for another &amp;quot;bytes&amp;quot;, then the ungrant space should be increase. In client, we should update cl_avail_grant only there is OBD_MD_FLGRANT in the reply. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when using O_DIRECT and quotas&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13930 13930] &lt;br /&gt;
&lt;br /&gt;
Description: Incorrect file ownership on O_DIRECT output files &lt;br /&gt;
&lt;br /&gt;
Details: block usage reported by &#039;lfs quota&#039; does not take into account files that have been written with O_DIRECT. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13976 13976] &lt;br /&gt;
&lt;br /&gt;
Description: touch file failed when fs is not full &lt;br /&gt;
&lt;br /&gt;
Details: OST in recovery should not be discarded by MDS in alloc_qos(), otherwise we can get ENOSP while fs is not full. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13805 13805] &lt;br /&gt;
&lt;br /&gt;
Description: data checksumming impacts single node performance &lt;br /&gt;
&lt;br /&gt;
Details: disable checksums by default since it impacts single node performance. It is still possible to enable checksums by default via &amp;quot;configure --enable-checksum&amp;quot;, or at runtime via procfs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: when lov objid is destroyed&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=14222 14222] &lt;br /&gt;
&lt;br /&gt;
Description: mds can&#039;t recreate lov objid file. &lt;br /&gt;
&lt;br /&gt;
Details: if lov objid file is destroyed and ost with highest index connected first mds not get last objid number from ost. Also if mds get last id from ost his not tell osc about this and it&#039;s produce warning about wrong del orphan request. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rarely&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12948 12948] &lt;br /&gt;
&lt;br /&gt;
Description: buffer overruns could theoretically occur &lt;br /&gt;
&lt;br /&gt;
Details: llapi_semantic_traverse() modifies the &amp;quot;path&amp;quot; argument by appending values to the end of the origin string, and a buffer overrun may occur. Adding buffer overrun check in liblustreapi. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13732 13732] &lt;br /&gt;
&lt;br /&gt;
Description: change order of libsysio includes &lt;br /&gt;
&lt;br /&gt;
Details: &#039;#include sysio.h&#039; should always come before &#039;#include xtio.h&#039; &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.2 to v1.6.3=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - any kernel supported by Lustre, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.2.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18-8.1.8.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.40.2-cfs1&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12192 12192] &lt;br /&gt;
&lt;br /&gt;
Description: llapi_file_create() does not allow some changes &lt;br /&gt;
&lt;br /&gt;
Details: add llapi_file_open() that allows specifying the file creation mode and open flags, and also returns an open file handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12743 12743] &lt;br /&gt;
&lt;br /&gt;
Description: df doesn&#039;t work properly if diskfs blocksize != 4K &lt;br /&gt;
&lt;br /&gt;
Details: Choose biggest blocksize of OST&#039;s as the LOV&#039;s blocksize. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11248 11248] &lt;br /&gt;
&lt;br /&gt;
Description: merge and cleanup kernel patches. &lt;br /&gt;
&lt;br /&gt;
Details: Remove mnt_lustre_list in vfs_intent-2.6-rhel4.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13039 13039] &lt;br /&gt;
&lt;br /&gt;
Description: RedHat Update kernel for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Kernel config file for RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12446 12446] &lt;br /&gt;
&lt;br /&gt;
Description: OSS needs mutliple precreate threads &lt;br /&gt;
&lt;br /&gt;
Details: Add ability to start more than one create thread per OSS. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13039 13039] &lt;br /&gt;
&lt;br /&gt;
Description: RedHat Update kernel for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Modify the kernel config file more closer RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13360 13360] &lt;br /&gt;
&lt;br /&gt;
Description: Build failure against Centos5 (RHEL5) &lt;br /&gt;
&lt;br /&gt;
Details: Define PAGE_SIZE when it isn&#039;t present. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11401 11401] &lt;br /&gt;
&lt;br /&gt;
Description: client-side metadata stat-ahead during readdir(directory readahead) &lt;br /&gt;
&lt;br /&gt;
Details: perform client-side metadata stat-ahead when the client detects readdir and sequential stat of dir entries therein &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11230 11230] &lt;br /&gt;
&lt;br /&gt;
Description: Tune the kernel for good SCSI performance. &lt;br /&gt;
&lt;br /&gt;
Details: Set the value of /sys/block/{dev}/queue/max_sectors_kb to the value of /sys/block/{dev}/queue/max_hw_sectors_kb in mount_lustre. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: critical&lt;br /&gt;
&lt;br /&gt;
Frequency: Always for filesystems larger than 2TB on 32-bit systems.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13547 13547] , [https://bugzilla.lustre.org/show_bug.cgi?id=13627 13627] &lt;br /&gt;
&lt;br /&gt;
Description: Data corruption for OSTs that are formatted larger than 2TB on 32-bit servers. &lt;br /&gt;
&lt;br /&gt;
Details: When generating the bio request for lustre file writes the sector number would overflow a temporary variable before being used for the IO. The data reads correctly from Lustre (which will overflow in a similar manner) but other file data or filesystem metadata may be corrupted in some cases. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13236 13236] &lt;br /&gt;
&lt;br /&gt;
Description: TOE Kernel panic by ksocklnd &lt;br /&gt;
&lt;br /&gt;
Details: offloaded sockets provide their own implementation of sendpage, can&#039;t call tcp_sendpage() directly &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13482 13482] &lt;br /&gt;
&lt;br /&gt;
Description: build error &lt;br /&gt;
&lt;br /&gt;
Details: fix typos in gmlnd, ptllnd and viblnd &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12932 12932] &lt;br /&gt;
&lt;br /&gt;
Description: obd_health_check_timeout too short &lt;br /&gt;
&lt;br /&gt;
Details: set obd_health_check_timeout as 1.5x of obd_timeout &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only with quota on the root user&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12223 12223] &lt;br /&gt;
&lt;br /&gt;
Description: mds_obd_create error creating tmp object &lt;br /&gt;
&lt;br /&gt;
Details: When the user sets quota on root, llog will be affected and can&#039;t create files and write files. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12782 12782] &lt;br /&gt;
&lt;br /&gt;
Description: /proc/sys/lnet has non-sysctl entries &lt;br /&gt;
&lt;br /&gt;
Details: Updating dump_kernel/daemon_file/debug_mb to use sysctl variables &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10778 10778] &lt;br /&gt;
&lt;br /&gt;
Description: kibnal_shutdown() doesn&#039;t finish; lconf --cleanup hangs &lt;br /&gt;
&lt;br /&gt;
Details: races between lnd_shutdown and peer creation prevent lnd_shutdown from finishing. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13279 13279] &lt;br /&gt;
&lt;br /&gt;
Description: open files rlimit 1024 reached while liblustre testing &lt;br /&gt;
&lt;br /&gt;
Details: ulnds/socklnd must close open socket after unsuccessful &#039;say hello&#039; attempt. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always on directories with default striping set&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12836 12836] &lt;br /&gt;
&lt;br /&gt;
Description: lfs find on -1 stripe looping in lsm_lmm_verify_common() &lt;br /&gt;
&lt;br /&gt;
Details: Avoid lov_verify_lmm_common() on directory with -1 stripe count. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Always on ia64 patchless client, and possibly others.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12826 12826] &lt;br /&gt;
&lt;br /&gt;
Description: Add EXPORT_SYMBOL check for node_to_cpumask symbol. &lt;br /&gt;
&lt;br /&gt;
Details: This allows the patchless client to be loaded on architectures without this export. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13142 13142] &lt;br /&gt;
&lt;br /&gt;
Description: disorder of journal start and llog_add cause deadlock. &lt;br /&gt;
&lt;br /&gt;
Details: in llog_origin_connect, journal start should happen before llog_add keep the same order as other functions to avoid the deadlock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: occasionally when using NFS&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13030 13030] &lt;br /&gt;
&lt;br /&gt;
Description: &amp;quot;ll_intent_file_open()) lock enqueue: err: -13&amp;quot; with nfs &lt;br /&gt;
&lt;br /&gt;
Details: with NFS, the anon dentry&#039;s parent was set to itself in d_alloc_anon(), so in MDS, we use rec-&amp;gt;ur_fid1 to find the corresponding dentry other than use rec-&amp;gt;ur_name. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Occasionally with failover&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12459 12459] &lt;br /&gt;
&lt;br /&gt;
Description: Client eviction due to failover config &lt;br /&gt;
&lt;br /&gt;
Details: after a connection loss, the lustre client should attempt to reconnect to the last active server first before trying the other potential connections. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with liblustre clients on XT3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12418 12418] &lt;br /&gt;
&lt;br /&gt;
Description: evictions taking too long &lt;br /&gt;
&lt;br /&gt;
Details: allow llrd to evict clients directly on OSTs &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13125 13125] &lt;br /&gt;
&lt;br /&gt;
Description: osts not allocated evenly to files &lt;br /&gt;
&lt;br /&gt;
Details: change the condition to increase offset_idx &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13436 13436] &lt;br /&gt;
&lt;br /&gt;
Description: Only those disconnect error should be returned by rq_status. &lt;br /&gt;
&lt;br /&gt;
Details: In open/enqueue processs, Some errors, which will cause client disconnected, should be returned by rq_status, while other errors should still be returned by intent, then mdc or llite will detect them. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13600 13600] &lt;br /&gt;
&lt;br /&gt;
Description: &amp;quot;lfs find -obd UUID&amp;quot; prints directories &lt;br /&gt;
&lt;br /&gt;
Details: &amp;quot;lfs find -obd UUID&amp;quot; will return all directory names instead of just file names. It is incorrect because the directories do not reside on the OSTs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13596 13596] &lt;br /&gt;
&lt;br /&gt;
Description: MDS hang after unclean shutdown of lots of clients &lt;br /&gt;
&lt;br /&gt;
Details: Never resend AST requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Always, for kernels after 2.6.16&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13304 13304] &lt;br /&gt;
&lt;br /&gt;
Description: Fix warning idr_remove called for id=.. which is not allocated. &lt;br /&gt;
Details: Last kernels save old s_dev before kill super and not allow to restore from callback - restore it before call kill_anon_super. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12186 12186] &lt;br /&gt;
&lt;br /&gt;
Description: Fix errors in lfs documentation &lt;br /&gt;
&lt;br /&gt;
Details: Fixes man pages &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12588 12588] &lt;br /&gt;
&lt;br /&gt;
Description: when mds and osts use different quota unit(32bit and 64bit), quota will be released repeatly. &lt;br /&gt;
&lt;br /&gt;
Details: void sending multiple quota reqs to mds, which will keep the status between the reqs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: cleanup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13532 13532] &lt;br /&gt;
&lt;br /&gt;
Description: rewrite ext2-derived code in llite/dir.c and obdclass/uuid.c &lt;br /&gt;
&lt;br /&gt;
Details: rewrite inherited code (uuid parsing code from ext2 utils and readdir code from ext3) from scratch preserving functionality. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.1 to v1.6.2=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.286 (SLES 9), 2.6.9-55.0.2.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18-8.1.8.el5 (RHEL 5), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?&lt;br /&gt;
title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs8&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12786 12786] &lt;br /&gt;
&lt;br /&gt;
Description: lfs setstripe enhancement &lt;br /&gt;
&lt;br /&gt;
Details: Make lfs setstripe understand &#039;k&#039;, &#039;m&#039; and &#039;g&#039; for stripe size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12211 12211] &lt;br /&gt;
&lt;br /&gt;
Description: randomly memory allocation failure util &lt;br /&gt;
&lt;br /&gt;
Details: Make lustre randomly failed allocating memory for testing purpose. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10786 10786] &lt;br /&gt;
&lt;br /&gt;
Description: omit set fsid for export NFS &lt;br /&gt;
&lt;br /&gt;
Details: fix set/restore device id for avoid EMFILE error and mark lustre fs as FS_REQUIRES_DEV for avoid problems with generate fsid. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10595 10595] &lt;br /&gt;
&lt;br /&gt;
Description: Error message improvement. &lt;br /&gt;
&lt;br /&gt;
Details: Merging of two LCONSOLE_ERROR_MSG into one. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12602 12606] &lt;br /&gt;
&lt;br /&gt;
Description: don&#039;t use GFP_* in generic Lustre code. &lt;br /&gt;
&lt;br /&gt;
Details: Use cfs_alloc_* functions and CFS_* flags for code portability. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12333 12333] &lt;br /&gt;
&lt;br /&gt;
Description: obdclass is limited by single OBD_ALLOC(idarray) &lt;br /&gt;
&lt;br /&gt;
Details: replace OBD_ALLOC/OBD_FREE with OBD_VMALLOC/OBD_VFREE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415 ]&lt;br /&gt;
&lt;br /&gt;
Description: updated patchess for new RHEL4 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Fixed ext3-unlink-race.patch per Kalpak&#039;s comment. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13006 13006 ]&lt;br /&gt;
&lt;br /&gt;
Description: warnings with build patchless client with vanila 2.6.19 and up &lt;br /&gt;
&lt;br /&gt;
Details: change old ctl_table style and replace ctl_table/ctl_table_header with cfs_sysctl_table_t/cfs_sysctl_table_header_t &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13093 13093] &lt;br /&gt;
&lt;br /&gt;
Description: O_DIRECT bypasses client statistics. &lt;br /&gt;
&lt;br /&gt;
Details: When running with O_DIRECT I/O, neither the client rpc_stats nor read_ahead_stats were updated. Copied stats section from osc_send_oap_rpc() into async_internal(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13249 13249] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel patches for SLES9 2.6.5-7.286 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Update target/ChangeLog/which_patch . &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12955 12955] &lt;br /&gt;
&lt;br /&gt;
Description: jbd statistics &lt;br /&gt;
&lt;br /&gt;
Details: Port older jbd statistics patch for sles10 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13360 13360] &lt;br /&gt;
&lt;br /&gt;
Description: Build failure against Centos5 (RHEL5) &lt;br /&gt;
&lt;br /&gt;
Details: Use getpagesize() instead of PAGE_SIZE. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: after network failures&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12769 12769] &lt;br /&gt;
&lt;br /&gt;
Description: Add sync option to mount_lustre.c &lt;br /&gt;
&lt;br /&gt;
Details: Client loses data written to lustre after a network interruption. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: mds/oss recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10800 10800] &lt;br /&gt;
&lt;br /&gt;
Description: llog ctxt is refrenced after it has been freed. &lt;br /&gt;
&lt;br /&gt;
Details: llog ctxt refcount was added to avoide the race between ctxt free and llog recovery process. Each llog user must hold ctxt refcount before it access the llog. And the llog ctxt can only be freed when its refcount is zero. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for SLES10&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12771 12771] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patch for SLES10 SP1 &lt;br /&gt;
&lt;br /&gt;
Details: Add patch blkdev_tunables-2.6-sles10.patch to 2.6-sles10.series. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11802 11802] &lt;br /&gt;
&lt;br /&gt;
Description: lustre support for RHEL5 &lt;br /&gt;
&lt;br /&gt;
Details: Add support for RHEL5. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11756 11756] &lt;br /&gt;
&lt;br /&gt;
Description: umount blocks forever on error &lt;br /&gt;
&lt;br /&gt;
Details: In result of wrong using obd_no_recov and obd_force flags client can hand if cancel or some other requests is lost. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Only for SLES&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13177 13177] &lt;br /&gt;
&lt;br /&gt;
Description: sanity_quota fail test_1 &lt;br /&gt;
&lt;br /&gt;
Details: There are multiple occurences of $TSTUSR in SLES&#039;s /etc/group file, which makes TSTID[2] inunique. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9977 9977] &lt;br /&gt;
&lt;br /&gt;
Description: lvbo_init failed for resource with missing objects. &lt;br /&gt;
&lt;br /&gt;
Details: Fix returning error if we do stat for file with missing/corrupted objects and i_size set to all sum of size of all avaible objects. if we to truncate/write to missing object - it is recreated. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: When flocks are used.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13103 13103] &lt;br /&gt;
&lt;br /&gt;
Description: assertion failure in ldlm_cli_enquque_fini for non NULL lock. &lt;br /&gt;
&lt;br /&gt;
Details: Flock locks might destroy just granted lock if it could be merged with another existing flock, this is done in completion handler, so teach ldlm_cli_enquque_fini that this is a valid case for flock locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: Rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11974 11974] &lt;br /&gt;
&lt;br /&gt;
Description: reply_lock_interpret crash due to race with it and lock cancel. &lt;br /&gt;
&lt;br /&gt;
Details: Do not replay locks that are being cancelled. Do not reference locks by their address during replay, just by their handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with deactivated OSTs&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11679 11679] &lt;br /&gt;
&lt;br /&gt;
Description: lstripe command fails for valid OST index &lt;br /&gt;
&lt;br /&gt;
Details: The stripe offset is compared to &#039;lov-&amp;gt;desc.ld_tgt_count&#039; instead of lov-&amp;gt;desc.ld_active_tgt_count. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13147 13147] &lt;br /&gt;
&lt;br /&gt;
Description: block reactivating mgc import until all deactivates complete &lt;br /&gt;
&lt;br /&gt;
Details: Fix race when failing back MDT/MGS to itself (testing) &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for Cray XT3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11706 11706] &lt;br /&gt;
&lt;br /&gt;
Description: peer credits not enough on many OST per OSS systems. &lt;br /&gt;
&lt;br /&gt;
Details: Use new lnet way to add credits as we need those for pings and ASTs &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with liblustre&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12790 12790] &lt;br /&gt;
&lt;br /&gt;
Description: Liblustre is not releasing flock locks on file close. &lt;br /&gt;
&lt;br /&gt;
Details: Release flock locks on file close. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for RHEL4&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12839 12839] &lt;br /&gt;
&lt;br /&gt;
Description: Update kernel patches for kernel-2.6.9-55.0.2.EL &lt;br /&gt;
&lt;br /&gt;
Details: Remove inode-nr_unused-2.6.9-rhel4.patch from 2.6-rhel4.series Update target file and kernel config. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11327 11327 ]&lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(export != NULL) failed in target_handle_connect &lt;br /&gt;
&lt;br /&gt;
Details: Assetion hit is result of rare race between disconnect and connet to same nid. target_handle_connect found old connect cockie and tried to reconnect, but can&#039;t find export for this cockie. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13276 13276] &lt;br /&gt;
&lt;br /&gt;
Description: Oops in read and write path when failing to allocate lock. &lt;br /&gt;
&lt;br /&gt;
Details: Check if lock allocation failed and return error back. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0.1 to v1.6.1=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 and 1.2 viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.6.5-7.283 (SLES 9), 2.6.9-55.EL (RHEL 4), 2.6.16.46-0.14 (SLES 10), 2.6.18.8 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see http://wiki.lustre.org/index.php?title=Patchless_Client) 2.6.16 - 2.6.22 vanilla (kernel.org)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Due to recently discovered recovery problems, we do not recommend using patchless RHEL 4 clients with this or any earlier release.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs8&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel. &lt;br /&gt;
Starting with this release, the ldiskfs backing filesystem required by Lustre is now in its own package, lustre-ldiskfs. This package should be installed. It is versioned separately from Lustre and may be released separately in future.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12194 12194] &lt;br /&gt;
&lt;br /&gt;
Description: add optional extra BUILD_VERSION info &lt;br /&gt;
&lt;br /&gt;
Details: add a new environment variable (namely LUSTRE_VERS) which allows to override the lustre version. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11548 11548] &lt;br /&gt;
&lt;br /&gt;
Description: Add LNET router traceability for debug purposes &lt;br /&gt;
&lt;br /&gt;
Details: If a checksum failure occurs with a router as part of the IO path, the NID of the last router that forwarded the bulk data is printed so it can be identified. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10997 10997] &lt;br /&gt;
&lt;br /&gt;
Description: lfs setstripe use optional parameters instead of postional parameters. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10651 10651] &lt;br /&gt;
&lt;br /&gt;
Description: Nanosecond timestamp support for ldiskfs &lt;br /&gt;
&lt;br /&gt;
Details: The on-disk ldiskfs filesystem has added support for nanosecond resolution timestamps. There is not yet support for this at the Lustre filesystem level. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10768 10768] &lt;br /&gt;
&lt;br /&gt;
Description: 64-bit inode version &lt;br /&gt;
&lt;br /&gt;
Details: : Add a on-disk 64-bit inode version for ext3 to track changes made to the inode. This will be required for version-based recovery. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11563 11563] &lt;br /&gt;
&lt;br /&gt;
Description: Add -o localflock option to simulate old noflock behaviour. &lt;br /&gt;
&lt;br /&gt;
Details: This will achieve local-only flock/fcntl locks coherentness. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11647 11647] &lt;br /&gt;
&lt;br /&gt;
Description: update patchless client &lt;br /&gt;
&lt;br /&gt;
Details: Add support for patchless client with 2.6.20, 2.6.21 and RHEL 5 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10589 10589] &lt;br /&gt;
&lt;br /&gt;
Description: metadata RPC reduction (e.g. for rm performance) &lt;br /&gt;
&lt;br /&gt;
Details: decrease the amount of synchronous RPC between clients and servers by canceling conflicing lock before the operation on the client and packing thier handles into the main operation RPC to server. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12605 12605] &lt;br /&gt;
&lt;br /&gt;
Description: add #ifdef HAVE_KERNEL_CONFIG_H &lt;br /&gt;
&lt;br /&gt;
Details: kernels from 2.6.19 not need include linux/config.h, but add include linux/autoconf.h in commpiler command line. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12764 12764] &lt;br /&gt;
&lt;br /&gt;
Description: patchless client support for 2.6.22 kernel &lt;br /&gt;
&lt;br /&gt;
Details: 2.6.22 has only one visble change, SLAB_CTOR_* constants is removed. In this case we need drop using os depended interface to kmem_cache and use cfs_mem_cache API. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10968 10968] &lt;br /&gt;
&lt;br /&gt;
Description: VFS operations stats tool. &lt;br /&gt;
&lt;br /&gt;
Details: Tool which collects stats by tracking value written in pid, ppid, gid and uses llstat to generate output to plot graph using plot-llstat Updated lustre/utils/Makefile.am Added lustre/utils/ltrack_stats.c &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11039 11039] &lt;br /&gt;
&lt;br /&gt;
Description: 2.6.18 server support (lustre 1.6.1) &lt;br /&gt;
&lt;br /&gt;
Details: Support for 2.6.18 kernels on the server side. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12678 12678] &lt;br /&gt;
&lt;br /&gt;
Description: remove fs_prep_san_write operation and related patches &lt;br /&gt;
&lt;br /&gt;
Details: remove the ext3-san-jdike patches which are no longer useful. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4900 4900] &lt;br /&gt;
&lt;br /&gt;
Description: Async OSC create to avoid the blocking unnecessarily. &lt;br /&gt;
&lt;br /&gt;
Details: If a OST has no remain object, system will block on the creating when need to create a new object on this OST. Now, ways use pre-created objects when available, instead of blocking on an empty osc while others are not empty. If we must block, we block for the shortest possible period of time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11721 11721] &lt;br /&gt;
&lt;br /&gt;
Description: Add printing inode info into message about error in writepage. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11971 11971] &lt;br /&gt;
&lt;br /&gt;
Description: Accessing a block bevice can re-enable I/O when Lustre is tearing down a device. &lt;br /&gt;
&lt;br /&gt;
Details: dev_clear_rdonly(bdev) must be called in kill_bdev() instead of blkdev_put(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only with mballoc3 code and deep extent trees&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12861 12861] &lt;br /&gt;
&lt;br /&gt;
Description: ldiskfs_ext_search_right: bad header in inode: unexpected eh_depth &lt;br /&gt;
&lt;br /&gt;
Details: a wrong check of extent headers in ldiskfs_ext_search_right() can cause the filesystem to be remounted read-only. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13129 13129] &lt;br /&gt;
&lt;br /&gt;
Description: server LBUG when shutting down &lt;br /&gt;
&lt;br /&gt;
Details: Block umount forever until the mount refcount is zero rather than giving up after an arbitrary timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: 2.6.18 servers only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12546 12546 ]&lt;br /&gt;
&lt;br /&gt;
Description: ll_kern_mount() doesn&#039;t release the module reference &lt;br /&gt;
&lt;br /&gt;
Details: The ldiskfs module reference count never drops down to 0 because ll_kern_mount() doesn&#039;t release the module reference. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12470 12470] &lt;br /&gt;
&lt;br /&gt;
Description: server LBUG when using old ost_num_threads parameter &lt;br /&gt;
&lt;br /&gt;
Details: Accept the old ost_num_threads parameter but warn that it is deprecated, and fix an off-by-one error that caused an LBUG. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11722 11722] &lt;br /&gt;
&lt;br /&gt;
Description: Transient SCSI error results in persistent IO issue &lt;br /&gt;
&lt;br /&gt;
Details: iobuf-&amp;gt;dr_error is not reinitialized to 0 between two uses. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: sometimes when underlying device returns I/O errors&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11743 11743] &lt;br /&gt;
&lt;br /&gt;
Description: OSTs not going read-only during write failures &lt;br /&gt;
&lt;br /&gt;
Details: OSTs are not remounted read-only when the journal commit threads get I/O errors because fsfilt_ext3 calls journal_start/stop() instead of the ext3 wrappers. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: SLES10 only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12538 12538] &lt;br /&gt;
&lt;br /&gt;
Description: sanity-quota.sh quotacheck failed: rc = -22 &lt;br /&gt;
&lt;br /&gt;
Details: Quotas cannot be enabled on SLES10. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: liblustre clients only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12229 12229] &lt;br /&gt;
&lt;br /&gt;
Description: getdirentries does not give error when run on compute nodes &lt;br /&gt;
&lt;br /&gt;
Details: getdirentries does not fail when the size specified as an argument is too small to contain at least one entry &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11315 11315] &lt;br /&gt;
&lt;br /&gt;
Description: OST &amp;quot;spontaneously&amp;quot; evicts client; client has imp_pingable == 0 &lt;br /&gt;
&lt;br /&gt;
Details: Due to a race condition, liblustre clients were occasionally evicted incorrectly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: during server recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11203 11203] &lt;br /&gt;
&lt;br /&gt;
Description: MDS failing to send precreate requests due to OSCC_FLAG_RECOVERING &lt;br /&gt;
&lt;br /&gt;
Details: request with rq_no_resend flag not awake l_wait_event if they get a timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11818 11818] &lt;br /&gt;
&lt;br /&gt;
Description: MDS fails to start if a duplicate client export is detected &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases it was possible for a client to connect to an MDS multiple times. Upon recovery the MDS would detect this and fail during startup. Handle this more gracefully. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12477 12477] &lt;br /&gt;
&lt;br /&gt;
Description: Wrong request locking in request set processing &lt;br /&gt;
&lt;br /&gt;
Details: ptlrpc_check_set wrongly uses req-&amp;gt;rq_lock for proctect add to imp_delayed_list, in this place should be used imp_lock. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when reconnecting&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant leak when OSC reconnect to OST &lt;br /&gt;
&lt;br /&gt;
Details: When osc reconnect ost, OST(filter) should check whether it should grant more space to client by comparing fed_grant and cl_avail_grant, and return the granted space to client instead of &amp;quot;new granted&amp;quot; space, because client will call osc_init_grant to update the client grant space info. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when client reconnects to OST&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant leak when OSC does a resend and replays bulk write &lt;br /&gt;
&lt;br /&gt;
Details: When osc reconnect to OST, OST(filter) should clear grant info of bulk write request, because the grant info while be sync between OSC and OST when reconnect, and we should ignore the grant info these of resend/replay write req. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11662 11662] &lt;br /&gt;
&lt;br /&gt;
Description: Grant space more than avaiable space sometimes. &lt;br /&gt;
&lt;br /&gt;
Details: When then OST is about to be full, if two bulk writing from different clients came to OST. Accord the avaliable space of the OST, the first req should be permitted, and the second one should be denied by ENOSPC. But if the seconde arrived before the first one is commited. The OST might wrongly permit second writing, which will cause grant space &amp;gt; avaiable space. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when client is evicted&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12371 12371] &lt;br /&gt;
&lt;br /&gt;
Description: Grant might be wrongly erased when osc is evicted by OST &lt;br /&gt;
&lt;br /&gt;
Details: when the import is evicted by server, it will fork another thread ptlrpc_invalidate_import_thread to invalidate the import, where the grant will be set to 0. While the original thread will update the grant it got when connecting. So if the former happened latter, the grant will be wrongly errased because of this race. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12401 12401] &lt;br /&gt;
&lt;br /&gt;
Description: Checking Stale with correct fid &lt;br /&gt;
&lt;br /&gt;
Details: ll_revalidate_it should uses de_inode instead of op_data.fid2 to check whether it is stale, because sometimes, we want the enqueue happened anyway, and op_data.fid2 will not be initialized. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only with 2.4 kernel&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12134 12134] &lt;br /&gt;
&lt;br /&gt;
Description: random memory corruption &lt;br /&gt;
&lt;br /&gt;
Details: size of struct ll_inode_info is to big for union inode.u and this can be cause of random memory corruption. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10818 10818] &lt;br /&gt;
&lt;br /&gt;
Description: Memory leak in recovery &lt;br /&gt;
&lt;br /&gt;
Details: Lov_mds_md was not free in an error handler in mds_create_object. It should also check obd_fail before fsfilt_start, otherwise if fsfilt_start return -EROFS,(failover mds during mds recovery). then the req will return with repmsg-&amp;gt;transno = 0 and rc = EROFS. and we met hit the assert LASSERT(req-&amp;gt;rq_reqmsg-&amp;gt;transno == req-&amp;gt;rq_repmsg-&amp;gt;transno) in ptlrpc_replay_interpret. Fcc should be freed no matter whether fsfilt_commit success or not. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11935 11935] &lt;br /&gt;
&lt;br /&gt;
Description: Not check open intent error before release open handle &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases, the open intent error is not checked before release open handle, which may cause ASSERTION(open_req-&amp;gt;rq_transno != 0), because it tries to release the failed open handle. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12556 12556] &lt;br /&gt;
&lt;br /&gt;
Description: Set cat log bitmap only after create log success. &lt;br /&gt;
&lt;br /&gt;
Details: in some rare cases, the cat log bitmap is set too early. and it should be set only after create log success. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12086 12086] &lt;br /&gt;
&lt;br /&gt;
Description: the cat log was not initialized in recovery &lt;br /&gt;
&lt;br /&gt;
Details: When mds(mgs) do recovery, the tgt_count might be zero, so the unlink log on mds will not be initialized until mds post recovery. And also in mds post recovery, the unlink log will initialization will be done asynchronausly, so there will be race between add unlink log and unlink log initialization. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12579 12597] &lt;br /&gt;
&lt;br /&gt;
Description: brw_stats were being printed incorrectly &lt;br /&gt;
&lt;br /&gt;
Details: brw_stats were being printed as log2 but all of them were not recorded as log2. Also remove some code duplication arising from filter_tally_{read,write}. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare, only in recovery.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11674 11674] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(req-&amp;gt;rq_type != LI_POISON) failed &lt;br /&gt;
&lt;br /&gt;
Details: imp_lock should be held while iterating over imp_sending_list for prevent destroy request after get timeout in ptlrpc_queue_wait. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12689 12689] &lt;br /&gt;
&lt;br /&gt;
Description: replay-single.sh test 52 fails &lt;br /&gt;
&lt;br /&gt;
Details: A lock&#039;s skiplist need to be cleanup when it being unlinked from its resource list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11737 11737] &lt;br /&gt;
&lt;br /&gt;
Description: Short directio read returns full requested size rather than &lt;br /&gt;
actual amount read. &lt;br /&gt;
&lt;br /&gt;
Details: Direct I/O operations should return actual amount of bytes transferred rather than requested size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12646 12646] &lt;br /&gt;
&lt;br /&gt;
Description: sanity.sh test_77h fails with &amp;quot;test_77h file compare failed&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: test_77h uses a file which was messed by other test case. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12576 12576] &lt;br /&gt;
&lt;br /&gt;
Description: Not Check whether lov_tgts is NULL in some lov functions &lt;br /&gt;
&lt;br /&gt;
Details: Checking whether lov_tgts is NULL in some functions. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11815 11815] &lt;br /&gt;
&lt;br /&gt;
Description: replace obdo_alloc() with OBDO_ALLOC macro &lt;br /&gt;
&lt;br /&gt;
Details: nothing special is done in obdo_alloc() function, and for debugging purpose, it needs to be replaced with macros. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12784 12784] &lt;br /&gt;
&lt;br /&gt;
Description: bad return value and errno from fcntl call &lt;br /&gt;
&lt;br /&gt;
Details: In liblustre API, errno should be a negative value if error happens. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11544 11544] &lt;br /&gt;
&lt;br /&gt;
Description: ptlrpc_check_set() LBUG &lt;br /&gt;
&lt;br /&gt;
Details: In case of positive reply from server and failed client bulk callback after bulk transfer shouldn&#039;t LBUG, but process this request as erroneous. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12696 12696] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION(imp-&amp;gt;imp_conn_current) failed &lt;br /&gt;
&lt;br /&gt;
Details: an assertion failure is hit if a client node boots and attempts to mount a lustre filesystem faster than RECONNECT_INTERVAL seconds. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for i686&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12695 12695] &lt;br /&gt;
&lt;br /&gt;
Description: 1.4.11 RC1 build fails for RHEL 4, i686 &lt;br /&gt;
&lt;br /&gt;
Details: Fixed config variable for build. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415] &lt;br /&gt;
&lt;br /&gt;
Description: Updated patchess for new RHEL4 kernel &lt;br /&gt;
&lt;br /&gt;
Details: Updated patch inode-nr_unused-2.6.9-rhel4.patch Updated patch jbd-stats-2.6.9.patch Updated patch qsnet-rhel4-2.6.patch Updated patch quota-deadlock-on-pagelock-core.patch Updated patch vfs_intent-2.6-rhel4.patch Updated patch vfs_races-2.6-rhel4.patch Updated series file 2.6-rhel4-titech.series Updated series file 2.6-rhel4.series Updated kernel config files &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12374 12374] &lt;br /&gt;
&lt;br /&gt;
Description: lquota slave complains LBUG when reconnecting with mds or &lt;br /&gt;
failover in mds. &lt;br /&gt;
&lt;br /&gt;
Details: quota slave depends qctxt-&amp;gt;lqc_import to send its quota request. This pointer will be invalid if mds did failover or broke its connect to osts, which leads to LBUG. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: when qunit size is too small(less than 20M)&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12588  12588] &lt;br /&gt;
&lt;br /&gt;
Description: write is stopped by improper -EDQUOT &lt;br /&gt;
&lt;br /&gt;
Details: If the master is busy and qunit size is small enough(let&#039;s say 1M), the slave can not get quota from master on time, which will lead slave to trigger a -EQUOTA to client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12629 12629] &lt;br /&gt;
&lt;br /&gt;
Description: Deadlock during metadata tests &lt;br /&gt;
&lt;br /&gt;
Details: in prune_dir_dentries(), shrink_dcache_parent() should not be called with the per-dentry lock held. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: SLES9 only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12744 12744] &lt;br /&gt;
&lt;br /&gt;
Description: Lustre patched kernel for SLES9 SP3 has NR_CPUS set to 8 &lt;br /&gt;
&lt;br /&gt;
Details: set CONFIG_NR_CPUS to 128 instead of 8. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11324 11324] &lt;br /&gt;
&lt;br /&gt;
Description: LDISKFS-fs error (device sdc): ldiskfs_free_blocks &lt;br /&gt;
&lt;br /&gt;
Details: a disk corruption can cause the mballoc code to assert on a double free or other extent corruptions. Handle these with ext3_error() instead of assertions. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13063 13063] &lt;br /&gt;
&lt;br /&gt;
Description: lfsck built against 1.4.x cannot run against 1.6.0 lustre &lt;br /&gt;
&lt;br /&gt;
Details: the definition for OBD_IOC_GETNAME changed in 1.6.0. One of the few external users of this ioctl number is lfsck&#039;s call to llapi_lov_get_uuids() and this caused lfsck to fail at startup. Add the old ioctl number to the handler so both old and new lfsck can work. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11301 11301] &lt;br /&gt;
&lt;br /&gt;
Description: parallel lock callbacks &lt;br /&gt;
&lt;br /&gt;
Details: Instead of sending blocking and completion callbacks as separated requests, adding them to a set and sending in parallel. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12417 12417] &lt;br /&gt;
&lt;br /&gt;
Description: Disable most debugging by default &lt;br /&gt;
&lt;br /&gt;
Details: To improve performance, disable most logging (for debug purposes) by default. VFSTRACE, RPCTRACE, and DLMTRACE are now off by default, and HA includes fewer messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11013 11013] &lt;br /&gt;
&lt;br /&gt;
Description: hash tables for lists of nids, connections and uuids &lt;br /&gt;
&lt;br /&gt;
Details: Hash tables noticeably help when a lot of clients connect to a server, to faster identify duplicate connections or reconnects, also to faster find export to evict in manual eviction case. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11190 11190] &lt;br /&gt;
&lt;br /&gt;
Description: Sometimes, when the server evict a client, and the client will not be evicted as soon as possible. &lt;br /&gt;
&lt;br /&gt;
Details: In enqueue req, the error was returned by intent, instead of rq_status which make ptlrpc layer not detect this error, and does not evict the client. So enqueue error should be returned by rq_status. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for SLES9&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12543 12543] &lt;br /&gt;
&lt;br /&gt;
Description: Routinely utilize latest Quadrics drivers in CFS releases &lt;br /&gt;
&lt;br /&gt;
Details: Update patch qsnet-suse-2.6.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only for sles10&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12771 12771] &lt;br /&gt;
&lt;br /&gt;
Description: Update patches for SLES 10 SP1 kernel. &lt;br /&gt;
&lt;br /&gt;
Details: Update the patch vfs_intent-2.6-sles10.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12543 12543] &lt;br /&gt;
&lt;br /&gt;
Description: Routinely utilize latest Quadrics drivers in CFS releases &lt;br /&gt;
&lt;br /&gt;
Details: Update patch qsnet-rhel4-2.6.patch. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12975 12975] &lt;br /&gt;
&lt;br /&gt;
Description: Using wrong pointer in osc_brw_prep_request &lt;br /&gt;
&lt;br /&gt;
Details: Access to array[-1] can produce panic if kernel compiled with CONFIG_PAGE_ALLOC enabled &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: only in recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13148 13148] &lt;br /&gt;
&lt;br /&gt;
Description: Mark OST as early accessible if his start SYNC. &lt;br /&gt;
&lt;br /&gt;
Details: osc_precreate return flag early accessible if oscc marked as OSCC_FLAG_SYNC_IN_PROGRESS. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13196 13196] &lt;br /&gt;
&lt;br /&gt;
Description: Sometimes precreate code can triger create object on wrong ost &lt;br /&gt;
&lt;br /&gt;
Details: Wrong protected or not not restored variables aftre precreate loop can produce creation object on wrong ost. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: oss recovery&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10800 10800] &lt;br /&gt;
&lt;br /&gt;
Description: llog_commit_thread cleanup should sync with llog_commit_thread &lt;br /&gt;
start &lt;br /&gt;
&lt;br /&gt;
Details: llog_commit_thread_count should be synced between llog_commit start and cleanup, so new llog_commit thread should not be started when llog_commit threads being stopped to avoid accessing some freed stuff. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with 10000 clients or more&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12364 12364] &lt;br /&gt;
&lt;br /&gt;
Description: poor connect scaling with increasing client count &lt;br /&gt;
&lt;br /&gt;
Details: Don&#039;t run filter_grant_sanity_check for more than 100 exports to improve scaling for large numbers of clients. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: nfs export on patchless client&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11970 11970] &lt;br /&gt;
&lt;br /&gt;
Description: connectathon hang when test nfs export over patchless client &lt;br /&gt;
&lt;br /&gt;
Details: Disconnected dentry cannot be found with lookup, so we do not need to unhash it or make it invalid &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11546 11546] &lt;br /&gt;
&lt;br /&gt;
Description: open req refcounting wrong on reconnect &lt;br /&gt;
&lt;br /&gt;
Details: If reconnect happened between getting open reply from server and call to mdc_set_replay_data in ll_file_open, we will schedule replay for unreferenced request that we are about to free. Subsequent close will crash in variety of ways. Check that request is still eligible for replay in mdc_set_replay_data(). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11512 11512] &lt;br /&gt;
&lt;br /&gt;
Description: disable writes to filesystem when reading health_check file &lt;br /&gt;
&lt;br /&gt;
Details: the default for reading the health_check proc file has changed to NOT do a journal transaction and write to disk, because this can cause reads of the /proc file to hang and block HA state checking on a healthy but otherwise heavily loaded system. It is possible to return to the previous behaviour during configure with --enable-health-write. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11658 11658 ]&lt;br /&gt;
&lt;br /&gt;
Description: log_commit_thread vs filter_destroy race leads to crash &lt;br /&gt;
&lt;br /&gt;
Details: Take import reference before releasing llog record semaphore &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only with huge numbers of clients&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11817 11817] &lt;br /&gt;
&lt;br /&gt;
Description: Prevents from taking the superblock lock in llap_from_page for a soon died page. &lt;br /&gt;
&lt;br /&gt;
Details: using LL_ORIGIN_REMOVEPAGE origin flag instead of LL_ORIGIN_UNKNOW for llap_from_page call in ll_removepage() prevents from taking the superblock lock for a soon died page. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11706 11706] &lt;br /&gt;
&lt;br /&gt;
Description: service threads may hog cpus when there are a lot of requests &lt;br /&gt;
&lt;br /&gt;
Details: Insert cond_resched to give other threads a chance to use some CPU &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12747 12747] &lt;br /&gt;
&lt;br /&gt;
Description: fix mal-formatted messages &lt;br /&gt;
&lt;br /&gt;
Details: fix some mal-formatted DEBUG_REQ and LCONSOLE_ERROR_MSG messages &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always in liblustre&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11737 11737] &lt;br /&gt;
&lt;br /&gt;
Description: wrong IS_ERR implementation in liblustre.h &lt;br /&gt;
&lt;br /&gt;
Details: fix IS_ERR implementation in liblustre.h for right detect errors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10419 10419] &lt;br /&gt;
&lt;br /&gt;
Description: Correct condition for output debug message. &lt;br /&gt;
&lt;br /&gt;
Details: inode i_nlink equal zero is not enough for output message about disk corruption, i_ctime and i_mode should be also checked. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always in patchless client&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12415 12415] &lt;br /&gt;
&lt;br /&gt;
Description: add configure check for truncate_complete_page &lt;br /&gt;
&lt;br /&gt;
Details: improve checks for exported symbols. This allow run check without &lt;br /&gt;
sources, but with Module.symvers shipped with kernel distribution. add check for truncate_complete_page used by patchless client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only run on patchless client.&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12858 12858] &lt;br /&gt;
&lt;br /&gt;
Description: use do_facet on sanity.sh for test handling recoverables errors &lt;br /&gt;
&lt;br /&gt;
Details: use do_facet instead of direct use sysctl for set fail_loc on OST &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only at startup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11778 11778] &lt;br /&gt;
&lt;br /&gt;
Description: Delay client connections to MDT until fist MDT-&amp;gt;OST connect &lt;br /&gt;
&lt;br /&gt;
Details: If a client tried to create a new file before the MDT had connected to any OSTs, the create would return EIO. Now the client will simply block until the MDT connects to the first OST and the create can succeed. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: at statup only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12860 12860] &lt;br /&gt;
&lt;br /&gt;
Description: mds_lov_synchronize race leads to various problems &lt;br /&gt;
&lt;br /&gt;
Details: simultaneous MDT-&amp;gt;OST connections at startup can cause the sync to abort, leaving the OSC in a bad state. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0 to v1.6.0.1=&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: on some architectures&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12404 12404] &lt;br /&gt;
&lt;br /&gt;
Description: 1.6 client sometimes fails to mount from a 1.4 MDT &lt;br /&gt;
&lt;br /&gt;
Details: Uninitialized flags sometimes cause configuration commands to be skipped. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: patchless clients only&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12391 12391] &lt;br /&gt;
&lt;br /&gt;
Description: missing __iget() symbol export &lt;br /&gt;
&lt;br /&gt;
Details: The __iget() symbol export is missing. To avoid the need for this on patchless clients the deathrow inode reaper is turned off, and we depend on the VM to clean up old inodes. This dependency was during via the fix for bug 12181. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12848 12848] &lt;br /&gt;
&lt;br /&gt;
Description: sanity.sh fail: test_52b &lt;br /&gt;
&lt;br /&gt;
Details: The ll_inode_to_ext_flags() has a glitch which makes MDS return incorrect inode&#039;s flags to client. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.4.10 to v1.6.0=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;CONFIGURATION CHANGE. This version of Lustre WILL NOT INTEROPERATE with older versions automatically. In many cases a special upgrade step is needed. Please read the user documentation before upgrading any part of a 1.4.x system.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;WARNING: Lustre configuration and startup changes are required with this release. See https://mail.clusterfs.com/wikis/lustre/MountConf for details.&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for kernels: 2.4.21-47.0.1.EL (RHEL 3), 2.6.5-7.283 (SLES 9), 2.6.9-42.0.10.EL (RHEL 4), 2.6.12.6 vanilla (kernel.org), 2.6.16.27-0.9 (SLES10)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Client support for unpatched kernels: (see https://mail.clusterfs.com/wikis/lustre/PatchlessClient) 2.6.16 - 2.6.19 vanilla (kernel.org), 2.6.9-42.0.8EL (RHEL 4)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Recommended e2fsprogs version: 1.39.cfs6&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note that reiserfs quotas are disabled on SLES 10 in this kernel&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4900 4900] &lt;br /&gt;
&lt;br /&gt;
Description: Async OSC create to avoid the blocking unnecessarily. &lt;br /&gt;
&lt;br /&gt;
Details: If a OST has no remain object, system will block on the creating when need to create a new object on this OST. Now, ways use pre-created objects when available, instead of blocking on an empty osc while others are not empty. If we must block, we block for the shortest possible period of time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=8007 8007] &lt;br /&gt;
&lt;br /&gt;
Description: MountConf &lt;br /&gt;
&lt;br /&gt;
Details: Lustre configuration is now managed via mkfs and mount commands instead of lmc and lconf. New obd types (MGS, MGC) are added for dynamic configuration management. See https://mail.clusterfs.com/wikis/lustre/MountConf for details. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4482 4482] &lt;br /&gt;
&lt;br /&gt;
Description: dynamic OST addition &lt;br /&gt;
&lt;br /&gt;
Details: OSTs can now be added to a live filesystem &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9851 9851] &lt;br /&gt;
&lt;br /&gt;
Description: startup order invariance &lt;br /&gt;
&lt;br /&gt;
Details: MDTs and OSTs can be started in any order. Clients only require the MDT to complete startup. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4899 4899] &lt;br /&gt;
&lt;br /&gt;
Description: parallel, asynchronous orphan cleanup &lt;br /&gt;
&lt;br /&gt;
Details: orphan cleanup is now performed in separate threads for each OST, allowing parallel non-blocking operation. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9862 9862] &lt;br /&gt;
&lt;br /&gt;
Description: optimized stripe assignment &lt;br /&gt;
&lt;br /&gt;
Details: stripe assignments are now made based on ost space available, ost previous usage, and OSS previous usage, in order to try to optimize storage space and networking resources. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=4226 4226] &lt;br /&gt;
&lt;br /&gt;
Description: Permanently set tunables &lt;br /&gt;
&lt;br /&gt;
Details: All writable /proc/fs/lustre tunables can now be permanently set on a per-server basis, at mkfs time or on a live system. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10547 10547] &lt;br /&gt;
&lt;br /&gt;
Description: Lustre message v2 &lt;br /&gt;
&lt;br /&gt;
Details: Add lustre message format v2. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9866 9866] &lt;br /&gt;
&lt;br /&gt;
Description: client OST exclusion list &lt;br /&gt;
&lt;br /&gt;
Details: Clients can be started with a list of OSTs that should be declared &amp;quot;inactive&amp;quot; for known non-responsive OSTs. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10088 10088] &lt;br /&gt;
&lt;br /&gt;
Description: fine-grained SMP locking inside DLM &lt;br /&gt;
&lt;br /&gt;
Details: Improve DLM performance on SMP systems by removing the single per-namespace lock and replace it with per-resource locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9332 9332] &lt;br /&gt;
&lt;br /&gt;
Description: don&#039;t hold multiple extent locks at one time &lt;br /&gt;
&lt;br /&gt;
Details: To avoid client eviction during large writes, locks are not held on multiple stripes at one time or for very large writes. Otherwise, clients can block waiting for a lock on a failed OST while holding locks on other OSTs and be evicted. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9293 9293] &lt;br /&gt;
&lt;br /&gt;
Description: Multiple MD RPCs in flight. &lt;br /&gt;
&lt;br /&gt;
Details: Further unserialise some read-only MDT RPCs - learn about intents. To avoid overly-overloading MDT, introduce a limit on number of MDT RPCs in flight for a single client and add /proc controls to adjust this limit. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22484 22484] &lt;br /&gt;
&lt;br /&gt;
Description: client read/write statistics &lt;br /&gt;
&lt;br /&gt;
Details: Add client read/write call usage stats for performance analysis of user processes. /proc/fs/lustre/llite/*/offset_stats shows non-sequential file access. extents_stats shows chunk size distribution. extents_stats_per_process show chunk size distribution per user process. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22485 22485] &lt;br /&gt;
&lt;br /&gt;
Description: per-client statistics on server &lt;br /&gt;
&lt;br /&gt;
Details: Add ldlm and operations statistics for each client in /proc/fs/lustre/mds|obdfilter/*/exports/ &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=22486 22486] &lt;br /&gt;
&lt;br /&gt;
Description: improved MDT statistics &lt;br /&gt;
&lt;br /&gt;
Details: Add detailed MDT operations statistics in /proc/fs/lustre/mds/*/stats &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10968 10968] &lt;br /&gt;
&lt;br /&gt;
Description: VFS operations stats &lt;br /&gt;
&lt;br /&gt;
Details: Add client VFS call stats, trackable by pid, ppid, or gid /proc/fs/lustre/llite/*/stats_track_[pid|ppid|gid] &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=2258 2258] &lt;br /&gt;
&lt;br /&gt;
Description: Dynamic service threads &lt;br /&gt;
&lt;br /&gt;
Details: Within a small range, start extra service threads automatically when the request queue builds up. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11229 11229] &lt;br /&gt;
&lt;br /&gt;
Description: Easy OST removal &lt;br /&gt;
&lt;br /&gt;
Details: OSTs can be permanently deactivated with e.g. &#039;lctl conf_param lustre-OST0001.osc.active=0&#039; &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11335 11335] &lt;br /&gt;
&lt;br /&gt;
Description: MGS proc entries &lt;br /&gt;
&lt;br /&gt;
Details: Added basic proc entries for the MGS showing what filesystems are served. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10998 10998] &lt;br /&gt;
&lt;br /&gt;
Description: provide MGS failover &lt;br /&gt;
&lt;br /&gt;
Details: Added config lock reacquisition after MGS server failover. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11461 11461] &lt;br /&gt;
&lt;br /&gt;
Description: add Linux 2.4 support &lt;br /&gt;
&lt;br /&gt;
Details: Added support for RHEL 2.4.21 kernel for 1.6 servers and clients &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10902 10902] &lt;br /&gt;
&lt;br /&gt;
Description: plain/inodebits lock performance improvement &lt;br /&gt;
&lt;br /&gt;
Details: Grouping plain/inodebits in granted list by their request modes and bits policy, thus improving the performance of search through the granted list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11667 11667] &lt;br /&gt;
&lt;br /&gt;
Description: Add &amp;quot;/proc/sys/lustre/debug_peer_on_timeout&amp;quot; &lt;br /&gt;
&lt;br /&gt;
Details: liblustre envirable: LIBLUSTRE_DEBUG_PEER_ON_TIMEOUT boolean to control whether to print peer debug info when a client&#039;s RPC times out. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11264 11264] &lt;br /&gt;
&lt;br /&gt;
Description: Add uninit_groups feature to ldiskfs2 to speed up e2fsck &lt;br /&gt;
&lt;br /&gt;
Details: The uninit_groups feature works in conjunction with the kernel filesystem code (ldiskfs2 only) and e2fsprogs-1.39-cfs6 to speed up the pass1 processing of e2fsck. This is a read-only feature in ldiskfs2 only, so older kernels and current ldiskfs cannot mount filesystems that have had this feature enabled. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10816 10816] &lt;br /&gt;
&lt;br /&gt;
Description: Improve multi-block allocation algorithm to avoid fragmentation &lt;br /&gt;
&lt;br /&gt;
Details: The mballoc3 code (ldiskfs2 only) adds new mechanisms to improve allocation locality and avoid filesystem fragmentation. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: mixed-endian client/server environments&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11214 11214] &lt;br /&gt;
&lt;br /&gt;
Description: mixed-endian crashes &lt;br /&gt;
&lt;br /&gt;
Details: The new msg_v2 system had some failures in mixed-endian environments. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: when an incorrect nid is specified during startup&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10734 10734] &lt;br /&gt;
&lt;br /&gt;
Description: ptlrpc connect to non-existant node causes kernel crash &lt;br /&gt;
&lt;br /&gt;
Details: LNET can&#039;t be re-entered from an event callback, which happened when we expire a message after the export has been cleaned up. Instead, hand the zombie cleanup off to another thread. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only if OST filesystem is corrupted&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=9829 9829] &lt;br /&gt;
&lt;br /&gt;
Description: client incorrectly hits assertion in ptlrpc_replay_req() &lt;br /&gt;
&lt;br /&gt;
Details: for a short time RPCs with bulk IO are in the replay list, but replay of bulk IOs is unimplemented. If the OST filesystem is corrupted due to disk cache incoherency and then replay is started it is possible to trip an assertion. Avoid putting committed RPCs into the replay list at all to avoid this issue. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: liblustre (e.g. catamount) on a large cluster with &amp;gt;= 8 OSTs/OSS&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11684 11684] &lt;br /&gt;
&lt;br /&gt;
Description: System hang on startup &lt;br /&gt;
&lt;br /&gt;
Details: This bug allowed the liblustre (e.g. catamount) client to return to the app before handling all startup RPCs. This could leave the node unresponsive to lustre network traffic and manifested as a server ptllnd timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: only for devices with external journals&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10719 10719] &lt;br /&gt;
&lt;br /&gt;
Description: Set external device read-only also &lt;br /&gt;
&lt;br /&gt;
Details: During a commanded failover stop, we set the disk device read-only while the server shuts down. We now also set any external journal device read-only at the same time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: when setting specific ost indicies&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11149 11149] &lt;br /&gt;
&lt;br /&gt;
Description: QOS code breaks on skipped indicies &lt;br /&gt;
&lt;br /&gt;
Details: Add checks for missing OST indicies in the QOS code, so OSTs created with --index need not be sequential. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12123 12123] &lt;br /&gt;
&lt;br /&gt;
Description: ENOENT returned for valid filehandle during dbench. &lt;br /&gt;
&lt;br /&gt;
Details: Check if a directory has children when invalidating dentries associated with an inode during lock cancellation. This fixes an incorrect ENOENT sometimes seen for valid filehandles during testing with dbench. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11330 11330] &lt;br /&gt;
&lt;br /&gt;
Description: a large application tries to do I/O to the same resource and dies in the middle of it. &lt;br /&gt;
&lt;br /&gt;
Details: Check the req-&amp;gt;rq_arrival time after the call to ost_brw_lock_get(), but before we do anything about processing it &amp;amp; sending the BULK transfer request. This should help move old stale pending locks off the queue as quickly as obd_timeout. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: SFS test only (otherwise harmless)&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=6062 6062] &lt;br /&gt;
&lt;br /&gt;
Description: SPEC SFS validation failure on NFS v2 over lustre. &lt;br /&gt;
&lt;br /&gt;
Details: Changes the blocksize for regular files to be 2x RPC size, and not depend on stripe size. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=6380 6380] &lt;br /&gt;
&lt;br /&gt;
Description: Fix client-side osc byte counters &lt;br /&gt;
&lt;br /&gt;
Details: The osc read/write byte counters in /proc/fs/lustre/osc/*/stats are now working &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always as root on SLES&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10667 10667] &lt;br /&gt;
&lt;br /&gt;
Description: Failure of copying files with lustre special EAs. &lt;br /&gt;
&lt;br /&gt;
Details: Client side always return success for setxattr call for lustre special xattr (currently only &amp;quot;trusted.lov&amp;quot;). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: always&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10345 10345] &lt;br /&gt;
&lt;br /&gt;
Description: Refcount LNET uuids &lt;br /&gt;
&lt;br /&gt;
Details: The global LNET uuid list grew linearly with every startup; refcount repeated list entries instead of always adding to the list. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: only for kernels with patches from Lustre below 1.4.3&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11248 11248] &lt;br /&gt;
&lt;br /&gt;
Description: Remove old rdonly API &lt;br /&gt;
&lt;br /&gt;
Details: Remove old rdonly API which unused from at least lustre 1.4.3 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: when upgrading from 1.4 while trying to change parameters&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11692 11692] &lt;br /&gt;
&lt;br /&gt;
Description: The wrong (new) MDC name was used when setting parameters for upgraded MDT&#039;s. Also allows changing of OSC (and MDC) parameters if --writeconf is specified at tunefs upgrade time. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.6.0 to v1.4.11=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1 viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12014 12014] &lt;br /&gt;
&lt;br /&gt;
Description: ASSERTION failures when upgrading to the patchless zero-copy socklnd &lt;br /&gt;
&lt;br /&gt;
Details: This bug affects &amp;quot;rolling upgrades&amp;quot;, causing an inconsistent protocol version negotiation and subsequent assertion failure during rolling upgrades after the first wave of upgrades. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10916 10916] &lt;br /&gt;
&lt;br /&gt;
Description: added LNET self test &lt;br /&gt;
&lt;br /&gt;
Details: landing b_self_test &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12316 12316] &lt;br /&gt;
&lt;br /&gt;
Description: Add OFED1.2 support to o2iblnd &lt;br /&gt;
&lt;br /&gt;
Details: o2iblnd depends on OFED&#039;s modules, if out-tree OFED&#039;s modules are installed (other than kernel&#039;s in-tree infiniband), there could be some problem while insmod o2iblnd (mismatch CRC of ib_* symbols). If extra Module.symvers is supported in kernel (i.e, 2.6.17), this link provides solution: https://bugs.openfabrics.org/show_bug.cgi?id=355 if extra Module.symvers is not supported in kernel, we will have to run the script in bug 12316 to update $LINUX/module.symvers before building o2iblnd. More details about this are in bug 12316. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11680 11680] &lt;br /&gt;
&lt;br /&gt;
Description: make panic on lbug configurable &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=13288 13288] &lt;br /&gt;
&lt;br /&gt;
Description: Initialize cpumask before use &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11223 11223] &lt;br /&gt;
&lt;br /&gt;
Details: Change &amp;quot;dropped message&amp;quot; CERRORs to D_NETERROR so they are logged instead of creating &amp;quot;console chatter&amp;quot; when a lustre timeout races with normal RPC completion. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Details: lnet_clear_peer_table can wait forever if user forgets to clear a lazy portal. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Details: libcfs_id2str should check pid against LNET_PID_ANY. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12227 12227] &lt;br /&gt;
&lt;br /&gt;
Description: cfs_duration_{u,n}sec() wrongly calculate nanosecond part of struct timeval. &lt;br /&gt;
&lt;br /&gt;
Details: do_div() macro is used incorrectly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Changes from v1.4.9 to v1.6.0=&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Support for networks: socklnd - kernels up to 2.6.16, qswlnd - Qsnet kernel modules 5.20 and later, openiblnd - IbGold 1.8.2, o2iblnd - OFED 1.1, viblnd - Voltaire ibhost 3.4.5 and later, ciblnd - Topspin 3.2.0, iiblnd - Infiniserv 3.3 + PathBits patch, gmlnd - GM 2.1.22 and later, mxlnd - MX 1.2.1 or later, ptllnd - Portals 3.3 / UNICOS/lc 1.5.x, 2.0.x&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=10316 10316] &lt;br /&gt;
&lt;br /&gt;
Description: Fixed console chatter in case of -ETIMEDOUT. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11684 11684] &lt;br /&gt;
&lt;br /&gt;
Description: Added D_NETTRACE for recording network packet history (initially only for ptllnd). Also a separate userspace ptllnd facility to gather history which should really be covered by D_NETTRACE too, if only CDEBUG recorded history in userspace. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: enhancement&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11094 11094] &lt;br /&gt;
&lt;br /&gt;
Description: Multiple instances for o2iblnd &lt;br /&gt;
&lt;br /&gt;
Details: Allow multiple instances of o2iblnd to enable networking over multiple HCAs and routing between them. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12458 12458] &lt;br /&gt;
&lt;br /&gt;
Description: Assertion failure in kernel ptllnd caused by posting passive bulk buffers before connection establishment complete. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12455 12455] &lt;br /&gt;
&lt;br /&gt;
Description: A race in kernel ptllnd between deleting a peer and posting new communications for it could hang communications - manifesting as &amp;quot;Unexpectedly long timeout&amp;quot; messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12432 12432] &lt;br /&gt;
&lt;br /&gt;
Description: Kernel ptllnd lock ordering issue could hang a node. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12016 12016] &lt;br /&gt;
&lt;br /&gt;
Description: node crash on socket teardown race &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: after Ptllnd timeouts and portals congestion&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11659 11659] &lt;br /&gt;
&lt;br /&gt;
Description: Credit overflows &lt;br /&gt;
&lt;br /&gt;
Details: This was a bug in ptllnd connection establishment. The fix implements better peer stamps to disambiguate connection establishment and ensure both peers enter the credit flow state machine consistently. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare &lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11394 11394] &lt;br /&gt;
&lt;br /&gt;
Description: kptllnd didn&#039;t propagate some network errors up to LNET &lt;br /&gt;
&lt;br /&gt;
Details: This bug was spotted while investigating 11394. The fix ensures network errors on sends and bulk transfers are propagated to LNET/lustre correctly. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Frequency: rare &lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11616 11616] &lt;br /&gt;
&lt;br /&gt;
Description: o2iblnd handle early RDMA_CM_EVENT_DISCONNECTED. &lt;br /&gt;
&lt;br /&gt;
Details: If the fabric is lossy, an RDMA_CM_EVENT_DISCONNECTED callback can occur before a connection has actually been established. This caused an assertion failure previously. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11201 11201] &lt;br /&gt;
&lt;br /&gt;
Description: lnet deadlock in router_checker &lt;br /&gt;
&lt;br /&gt;
Details: turned ksnd_connd_lock, ksnd_reaper_lock, and ksock_net_t:ksnd_lock into BH locks to eliminate potential deadlock caused by ksocknal_data_ready() preempting code holding these locks. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: major&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11126 11126] &lt;br /&gt;
&lt;br /&gt;
Description: Millions of failed socklnd connection attempts cause a very slow FS &lt;br /&gt;
&lt;br /&gt;
Details: added a new route flag ksnr_scheduled to distinguish from ksnr_connecting, so that a peer connection request is only turned down for race concerns when an active connection to the same peer is under progress (instead of just being scheduled). &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: gmlnd ignored some transmit errors when finalizing lnet messages. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: normal&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=11472 11472] &lt;br /&gt;
&lt;br /&gt;
Description: Changed the default kqswlnd ntxmsg=512 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: Ptllnd didn&#039;t init kptllnd_data.kptl_idle_txs before it could be possibly accessed in kptllnd_shutdown. Ptllnd should init kptllnd_data.kptl_ptlid2str_lock before calling kptllnd_ptlid2str. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: ptllnd logs a piece of incorrect debug info in kptllnd_peer_handle_hello. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: rare&lt;br /&gt;
&lt;br /&gt;
Description: the_lnet.ln_finalizing was not set when the current thread is about to complete messages. It only affects multi-threaded user space LNet. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
*Severity: minor&lt;br /&gt;
&lt;br /&gt;
Frequency: &#039;lctl peer_list&#039; issued on a mx net&lt;br /&gt;
&lt;br /&gt;
Bugzilla: [https://bugzilla.lustre.org/show_bug.cgi?id=12237 12237] &lt;br /&gt;
&lt;br /&gt;
Description: Enable lctl&#039;s peer_list for MXLND &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://wiki.lustre.org/index.php?title=Change_Log_1.4 change log 1.4]=&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4215</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4215"/>
		<updated>2008-01-28T03:49:53Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
** 2007&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4214</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4214"/>
		<updated>2008-01-28T03:44:05Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Synopsys */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)&lt;br /&gt;
&lt;br /&gt;
** Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
&lt;br /&gt;
** Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
&lt;br /&gt;
** Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
&lt;br /&gt;
** 2007&lt;br /&gt;
&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Main_Page&amp;diff=4213</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Main_Page&amp;diff=4213"/>
		<updated>2008-01-28T03:35:48Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Other Resources */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== What is Lustre®? ==&lt;br /&gt;
&lt;br /&gt;
Lustre® is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Cluster File Systems, Inc.&lt;br /&gt;
&lt;br /&gt;
The central goal is the development of a next-generation cluster file system which can serve clusters with 10,000&#039;s of nodes, provide petabytes of storage, and move 100&#039;s of GB/sec with state-of-the-art security and management infrastructure.&lt;br /&gt;
&lt;br /&gt;
Lustre runs on many of the largest Linux clusters in the world, and is included by CFS&#039;s partners as a core component of their cluster offering (examples include HP StorageWorks SFS, and the Cray XT3 and XD1 supercomputers). Today&#039;s users have also demonstrated that Lustre scales down as well as it scales up, and runs in production on clusters as small as 4 and as large as 25,000 nodes.&lt;br /&gt;
&lt;br /&gt;
The latest version of Lustre is always available from Cluster File Systems, Inc. Public Open Source releases of Lustre are available under the GNU General Public License. These releases are found here, and are used in production supercomputing environments worldwide.&lt;br /&gt;
&lt;br /&gt;
To be informed of Lustre releases, subscribe to the [http://wiki.lustre.org/index.php?title=Mailing_Lists lustre-announce] mailing list.&lt;br /&gt;
&lt;br /&gt;
Lustre development would not have been possible without funding and guidance from many organizations, including several U.S. National Laboratories, early adopters, and product partners.&lt;br /&gt;
&lt;br /&gt;
== User Resources == &lt;br /&gt;
&lt;br /&gt;
* [http://www.sun.com/software/products/lustre/get.jsp Lustre Downloads]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Lustre_Quick_Start Lustre Quick Start]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Mailing_Lists Mailing Lists]&lt;br /&gt;
* [http://manual.lustre.org/index.php?title=Main_Page Lustre Operations Manual]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Bug_Filing Filing Bugs]&lt;br /&gt;
* [https://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base]&lt;br /&gt;
&lt;br /&gt;
== Advanced User Resources == &lt;br /&gt;
&lt;br /&gt;
*[http://wiki.lustre.org/index.php?title=BuildLustre How to build Lustre]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Kerb_Lustre Kerberos]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=LustreTuning Lustre Tuning]&lt;br /&gt;
* [http://wiki.lustre.org/images/7/78/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the proc tunable parameters for Lustre and their usage. It describes several of the proc tunables including those that effect the client&#039;s RPC behavior and prepare for a substantial reorganization of proc entries.&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=LibLustre_HowTo Liblustre HowTo]&lt;br /&gt;
&lt;br /&gt;
== Lustre Centres of Excellence™ ==&lt;br /&gt;
&lt;br /&gt;
* [http://ornl-lce.clusterfs.com/index.php?title=Main_Page ORNL]&lt;br /&gt;
* [http://www.clusterfs-mwiki.com/cea-lce CEA]&lt;br /&gt;
* [http://www.clusterfs-mwiki.com/llnl-lce LLNL]&lt;br /&gt;
* [http://www.clusterfs-mwiki.com/psc-lce/index.php?title=Main_Page PSC]&lt;br /&gt;
* [http://www.clusterfs-mwiki.com/tsinghua-lce Tsinghua]&lt;br /&gt;
&lt;br /&gt;
== Developer Resources ==&lt;br /&gt;
* [http://arch.lustre.org Lustre Architecture]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Contribution_Policy Contribution Policy]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Mailing_List Developer Mailing List]&lt;br /&gt;
* CVS usage&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Open_CVS CVS access to Lustre Source]&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Cvs_Branches CVS Branches] - How to manage branches with CVS.&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Cvs_Tips CVS Tips] - Helpful things to know while using Lustre CVS.&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Lustre_Debugging Debugging Lustre] - A guide to debugging Lustre.&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=ZFS_Resources ZFS Resources] - Learn about ZFS.&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Coding_Guidelines Coding Guidelines] - Developer guidelines to avoid problems during Lustre code merges.&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Portals_Ping_Client_Server Portals Ping Server Client] - Kernel modules used to test basic message passing of ports.&lt;br /&gt;
&lt;br /&gt;
== CFS Development Projects ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=IOPerformanceProject I/O Performance]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU Lustre OSS/MDS with ZFS DMU]&lt;br /&gt;
&lt;br /&gt;
== Community Development Projects ==&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Networking_Development Networking Development]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Diskless_Booting Diskless Booting]&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Drbd_And_Lustre DRBD and Lustre]&lt;br /&gt;
* [http://www.bullopensource.org/lustre Bull- Open Source tools for Lustre]&lt;br /&gt;
* [http://www.sourceforge.net/projects/lmt LLNL- Lustre Monitoring Tool]&lt;br /&gt;
&lt;br /&gt;
== Other Resources ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Lustre_Publications Lustre Publications] - Papers and presentations about Lustre&lt;br /&gt;
* Lustre User Group&lt;br /&gt;
** [https://www.regonline.com/builder/site/Default.aspx?eventid=181696 Lustre User Group 2008]&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Lug_07 Lustre User Group 2007]&lt;br /&gt;
** [http://wiki.lustre.org/index.php?title=Lug_06 Lustre User Group 2006]&lt;br /&gt;
** LUG Requirements Forum - [http://wiki.lustre.org/images/7/78/LUG-Requirements-060420-final.pdf LUG-Requirements-060420-final.pdf] | [http://wiki.lustre.org/images/7/78/LUG-Requirements-060420-final.xls LUG-Requirements-060420-final.xls]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4212</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4212"/>
		<updated>2008-01-28T03:33:33Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
=== Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)===&lt;br /&gt;
&lt;br /&gt;
* Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
&lt;br /&gt;
* Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
&lt;br /&gt;
* Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
&lt;br /&gt;
* 2007&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4211</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4211"/>
		<updated>2008-01-28T03:33:10Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
=== Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)===&lt;br /&gt;
&lt;br /&gt;
* Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
&lt;br /&gt;
* Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
&lt;br /&gt;
* Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
&lt;br /&gt;
* 2007&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf] paper in pdf format ]&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4210</id>
		<title>Lustre Publications</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=Lustre_Publications&amp;diff=4210"/>
		<updated>2008-01-28T03:31:49Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: /* Synopsys */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== CFS ==&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a3/Gelato-2004-05.pdf &#039;&#039;&#039;Lustre state and production installations&#039;&#039;&#039;]&lt;br /&gt;
** Presentation on gelato.org meeting&lt;br /&gt;
** May 2004 &lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/e/ea/Lustre-usg-2003.pdf &#039;&#039;&#039;Lustre File System &#039;&#039;&#039;]&lt;br /&gt;
** A presentation on the state of Lustre in mid-2003 and the path towards Lustre1.0.&lt;br /&gt;
** Summer, 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/d/d2/Ols2003.pdf  &#039;&#039;&#039;Lustre: Building a cluster file system for 1,000 node clusters&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation about our successes and mistakes during 2002-2003.&lt;br /&gt;
** Summer 2003&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/6/6f/T10-062002.pdf &#039;&#039;&#039;Lustre: Scalable Clustered Object Storage&#039;&#039;&#039;]&lt;br /&gt;
** A technical presentation on Lustre.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/b5/001_lustretechnical-fall2002.pdf &#039;&#039;&#039;Lustre - the inter-galactic cluster file system?&#039;&#039;&#039;]&lt;br /&gt;
** A technical overview of Lustre from 2002.&lt;br /&gt;
** June,2002&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/7/79/Intragalactic-2001.pdf &#039;&#039;&#039;Lustre Light: a simpler fully functional cluster file system&#039;&#039;&#039;]&lt;br /&gt;
** September, 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/c/c9/LustreSystemAnatomy.pdf &#039;&#039;&#039;Lustre System Anatomy&#039;&#039;&#039;]&lt;br /&gt;
** Lustre component overview.&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/af/Intergalactic-062001.pdf &#039;&#039;&#039;Lustre: the intergalactic file system for the international labs?&#039;&#039;&#039;]&lt;br /&gt;
** Presentation for Linux World and elsewhere on Lustre and Next Generation Data Centers&lt;br /&gt;
** June,2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/4/44/Obdcluster.pdf &#039;&#039;&#039;The object based storage cluster file systems and parallel I/O&#039;&#039;&#039;]&lt;br /&gt;
** Sandia presentation on Lustre and Linux clustering&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/a/a2/Sdi-clusters.pdf &#039;&#039;&#039;Linux clustering and storage management&#039;&#039;&#039;]&lt;br /&gt;
** Powerpoint slides of an overview of cluster and OBD technology&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/8/81/Lustre-sow-dist.pdf &#039;&#039;&#039;Lustre Technical Project Summary&#039;&#039;&#039;]&lt;br /&gt;
** A Lustre roadmap presented to address the [http://wiki.lustre.org/images/7/70/SGSRFP.pdf Tri-Labs/DOD SGS File System RFP].&lt;br /&gt;
** July 2001&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/b/bd/Dfsprotocols.pdf &#039;&#039;&#039;File Systems for Clusters from a Protocol Perspective&#039;&#039;&#039;] &lt;br /&gt;
** A comparative description of several distributed file systems.&lt;br /&gt;
** Proc. Second Extreme Linux Topics Workshop, Monterey CA, Jun. 1999.&lt;br /&gt;
&lt;br /&gt;
* [http://www.pdl.cs.cmu.edu/NASD &#039;&#039;&#039;CMU NASD project&#039;&#039;&#039;]&lt;br /&gt;
&lt;br /&gt;
* [http://wiki.lustre.org/images/2/24/Osd-r03.pdf &#039;&#039;&#039;Working draft T10 OSD&#039;&#039;&#039;]&lt;br /&gt;
** A standards effort exists in the T10 OSD working group proposal.&lt;br /&gt;
** October 2000&lt;br /&gt;
&lt;br /&gt;
== Cray User Group ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Guidelines for Efficient Parallel I/O on the Cray XT3/XT4 &#039;&#039;&#039;&lt;br /&gt;
** Jeff Larkin, Mark Fahey, proceedings of CUG2007&lt;br /&gt;
** [http://wiki.lustre.org/images/3/3f/Larkin_paper.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XT7? Integrating and Operating a Conjoined XT3+XT4 System&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/b/b9/Canon_slides.pdf Presentation:] Presented by ORNL on CUG 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/f/fa/Canon_paper.pdf Paper:]This paper describes the processes and tools used to move production work from the pre-existing XT3 to the new system incorporating that same XT3, including novel application of Lustre routing capabilities.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Using IOR to Analyze the I/O Performance&#039;&#039;&#039;&lt;br /&gt;
** Presented by Hongzhang Shan,John Shalf (NERSC) on CUG 2007&lt;br /&gt;
**[http://wiki.lustre.org/images/e/ef/Using_IOR_to_Analyze_IO_Performance.pdf Slides in PDF format]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;A Center-Wide File System using Lustre&#039;&#039;&#039;&lt;br /&gt;
** Shane Canon, H.Sharp Oral, proceedings of CUG2006&lt;br /&gt;
** [http://wiki.lustre.org/images/7/77/A_Center-Wide_FS_using_Lustre.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== HEPiX Talks == &lt;br /&gt;
* [https://indico.desy.de/conferenceTimeTable.py?confId=257&amp;amp;showDate=all&amp;amp;showSession=all&amp;amp;detailLevel=contribution&amp;amp;viewMode=plain Spring HEPiX 2007]: April 23-27, 2007&lt;br /&gt;
* &#039;&#039;&#039;Storage Evaluations at BNL&#039;&#039;&#039;&lt;br /&gt;
** Presented by Robert Petkus - BNL&lt;br /&gt;
** Performance comparison between ZFS, XFS and EXT3 on a Sun Thumper&lt;br /&gt;
** [http://wiki.lustre.org/images/d/da/Storage_Evaluations%40BNL.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HPPix site: [https://indico.desy.de/getFile.py/access?contribId=26&amp;amp;amp;sessionId=40&amp;amp;amp;resId=1&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Storage Evaluations at BNL]&lt;br /&gt;
&lt;br /&gt;
*  &#039;&#039;&#039;Lustre Experience at CEA/DIF&#039;&#039;&#039;&lt;br /&gt;
** Presented by J-Ch Lafoucriere &lt;br /&gt;
** [http://wiki.lustre.org/images/5/58/DIF.pdf Slides in PDF format]&lt;br /&gt;
** Slides on HEPix site: [https://indico.desy.de/getFile.py/access?contribId=44&amp;amp;amp;sessionId=39&amp;amp;amp;resId=0&amp;amp;amp;materialId=slides&amp;amp;amp;confId=257 Lustre Experience at CEA/DIF]&lt;br /&gt;
&lt;br /&gt;
== Indiana University ==&lt;br /&gt;
* &#039;&#039;&#039;Wide Area Filesystem Performance using Lustre on the TeraGrid&#039;&#039;&#039;&lt;br /&gt;
** TeraGrid 2007 conference, June 2007&lt;br /&gt;
** [http://wiki.lustre.org/images/2/20/Lustre_wan_tg07.pdf Paper in PDF format]&lt;br /&gt;
&lt;br /&gt;
== Karlsruhe Lustre Talks ==&lt;br /&gt;
&lt;br /&gt;
* http://www.rz.uni-karlsruhe.de/dienste/lustretalks.php&lt;br /&gt;
* &#039;&#039;&#039;Filesystems on SSCK&#039;s HP XC6000&#039;&#039;&#039;&lt;br /&gt;
** Einführungsveranstaltung im Rechenzentrum (2005): [http://wiki.lustre.org/images/7/7c/Karlsruhe0503.pdf Karlsruhe0503.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences &amp;amp; Performance of SFS/Lustre Cluster File System in Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 4 in Krakau (10.5.2005): [http://wiki.lustre.org/images/9/95/Karlsruhe0510.pdf Karlsruhe0510.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** ISC 2005 in Heidelberg (24.6.2005): [http://wiki.lustre.org/images/5/5f/Karlsruhe0506.pdf Karlsruhe0506.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with 10 Months HP SFS/Lustre in HPC Production&#039;&#039;&#039;&lt;br /&gt;
** HP-CAST 5 in Seattle (11.11.2005):  [http://wiki.lustre.org/images/1/17/Karlsruhe0511.pdf Karlsruhe0511.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Performance Monitoring in a HP SFS Environment&#039;&#039;&#039;&lt;br /&gt;
** HP-CCN in Seattle (12.11.2005): [http://wiki.lustre.org/images/a/aa/Karlsruhe0512.pdf Karlsruhe0512.pdf]&lt;br /&gt;
* &#039;&#039;&#039;Experiences with HP SFS/Lustre at SSCK&#039;&#039;&#039;&lt;br /&gt;
** SGPFS 5 in Stuttgart (4.4.2006): [http://wiki.lustre.org/images/0/0b/Karlsruhe0604.pdf Karlsruhe0604.pdf]&lt;br /&gt;
&lt;br /&gt;
== Ohio State University == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Benefits of High Speed Interconnects to Cluster File Systems: A Case Study with Lustre&#039;&#039;&#039;&lt;br /&gt;
** Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. &lt;br /&gt;
** Lustre performance comparison when using InfiniBand and Quadrics interconnects&lt;br /&gt;
** [http://wiki.lustre.org/images/d/d8/Cac06_lustre.pdf Paper in PDf format]&lt;br /&gt;
** [http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/yu-cac06.pdf Download paper at OSU site]&lt;br /&gt;
&lt;br /&gt;
== ORNL == &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Exploiting Lustre File Joining for Effective Collective IO&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/d/db/Yu_lustre.pdf Paper in pdf format]&lt;br /&gt;
** Proceedings of the CCGrid&#039;07, May 2007.&lt;br /&gt;
&lt;br /&gt;
== SUN == &lt;br /&gt;
* &#039;&#039;&#039;Tokyo Tech Tsubame Grid Storage Implementation&#039;&#039;&#039;&lt;br /&gt;
** By Syuuichi Ihara, May,2007&lt;br /&gt;
** [http://wiki.lustre.org/images/7/79/Thumper-BP-6.pdf Paper in pdf format]&lt;br /&gt;
** [http://www.sun.com/blueprints/0507/820-2187.html Sun BluePrints Publications]&lt;br /&gt;
&lt;br /&gt;
== Synopsys ==&lt;br /&gt;
&lt;br /&gt;
=== Optimizing Storage and I/O For Distributed Processing On Enterprise &amp;amp; High Performance Compute(HPC)Systems For Mask Data Preparation Software (CATS)===&lt;br /&gt;
&lt;br /&gt;
* Glenn Newell, Sr.IT Solutions Mgr,&lt;br /&gt;
&lt;br /&gt;
* Naji Bekhazi,Director Of R&amp;amp;D,Mask Data Prep (CATS)&lt;br /&gt;
&lt;br /&gt;
* Ray Morgan,Sr.Product Marketing Manager,Mask Data Prep(CATS)&lt;br /&gt;
&lt;br /&gt;
* 2007&lt;br /&gt;
&lt;br /&gt;
 http://wiki.lustre.org/index.php?title=Image:Hpc_cats_wp.pdf&lt;br /&gt;
&lt;br /&gt;
== University of Colorado, Boulder ==&lt;br /&gt;
* &#039;&#039;&#039;Shared Parallel Filesystem in Heterogeneous Linux Multi-Cluster Environment&#039;&#039;&#039;&lt;br /&gt;
** [http://wiki.lustre.org/images/8/81/LciPaper.pdf Paper in PDF format]&lt;br /&gt;
** proceedings of the 6th LCI International Conference on Linux Clusters: The HPC Revolution.(2005)&lt;br /&gt;
** The management issues mentioned in the last part of this paper have been addressed&lt;br /&gt;
** [http://linuxclustersinstitute.org/Linux-HPC-Revolution/Archive/PDF05/17-Oberg_M.pdf Paper at CU site](It&#039;s the same as the attachment of LCI paper above.)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== University of Minnesota ==&lt;br /&gt;
* &#039;&#039;&#039;Coordinating Parallel Hierarchical Storage Management in Object-base Cluster File Systems&#039;&#039;&#039;&lt;br /&gt;
** MSST2006, Conference on Mass Storage Systems and Technologies(May 2006)&lt;br /&gt;
**[http://wiki.lustre.org/images/f/fc/MSST-2006-paper.pdf Paper in PDF format]&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
	<entry>
		<id>http://wiki.old.lustre.org/index.php?title=File:Hpc_cats_wp.pdf&amp;diff=4209</id>
		<title>File:Hpc cats wp.pdf</title>
		<link rel="alternate" type="text/html" href="http://wiki.old.lustre.org/index.php?title=File:Hpc_cats_wp.pdf&amp;diff=4209"/>
		<updated>2008-01-28T03:30:53Z</updated>

		<summary type="html">&lt;p&gt;Lollsolo: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Lollsolo</name></author>
	</entry>
</feed>