WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - MPI IO and NetCDF: Difference between revisions
| No edit summary | |||
| (2 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| '''''Note:''''' ''The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.''  | |||
| == Definitions == | == Definitions == | ||
Latest revision as of 14:17, 22 January 2010
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Definitions
ADIO - Abstract device interface for parallel I/O. Here, it specially means ADIO driver inside MPI.
HDF5 - Hierarchical Data Format (HDF5) is a set of software libraries and machine-independent standard for storing scientific data(metadata and array data) in files. Here, it specially means the HDF5 library.
NetCDF - NetCDF (network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. Similar as HDF5, but do not support parallel I/O.
Background
Good Parallel I/O performance on lustre not only depends on the proper I/O pattern of the software, but also depends on the good behavior of special filesystem and its ADIO driver. This paper will discuss this topic in three fields.
- Lustre ADIO driver improvements.
- Lustre filesystem internal improvments for parallel I/o.
- HDF5 and netcdf with lustre
Lustre ADIO driver improvements
Use cases
| ADIO collective Open | performance | Processes from different clients open the shared file at the same time. | 
| ADIO collective Open/Create | performance | Processes from different clients open/create its own files under the same dir at the same time. | 
| ADIO collective I/O | performance | Processes from different clients read/write the shared file at the same time | 
| ADIO preallocation | usability | Preallocate the space for the application in user level | 
ADIO collective open
| Scenario: | MPI collective open call this collective open API to open the file | 
| Bussiness Goals: | performance | 
| Relevant QA's: | performace | 
| Environment: | MPI enviroment | 
| Implementation: | One client will do lookup and open. then retrieve back lustre open handle(parent fid, child fid) of this object. and distributed this handle to other clients. Other clients, received this handle, will call an ioctl or open with special flag with the handle and go to mds to open the handle directly(without lookup). And also check whether the child fid is correctly. | 
| implementation constraints: | Only lookup process occupies a lot of the whole open process(for example > 30%), then collective open should be used. "Collective open" should support recovery. "collective open" semantics should not be changed. | 
ADIO collective open/create
ADIO collective open/create
| Scenario: | MPI collective open/create will this collective open API | 
| Business Goals: | performance | 
| Relevant QA's: | performance | 
| Environment: | MPI environment | 
| Implementation: | One client (or several choosed clients) will gather the create information(name, dir) from other clients. then it(or they) will send the "collecting" creation req to MDS, MDS might need a new handler to handle "collective" create request, then distributed the created handle(fid, name, parent_handle) to other clients. Then other clients will go to mds with the handle and open that. | 
| implementation constraints: | "Collective open/create" will be choosen only there is a lot processes (>500) create the files under the same dir at the same time. "collective open/create" must support recovery.keep MPI collective open/create semantics unchanged. | 
ADIO collective I/O
| Scenario: | MPI collective read/write API | 
| Business Goals: | Performance and find right parallel I/O hints for lustre, | 
| Relevant QA's: | performance | 
| Environment: | MPI environment | 
| Implementation: | Adjust I/O segment_size(ind_wr/rd_buffer_size, cb_buffer_size) automatically for some application, Avoid "flock" in read-modify-write process of "collective" I/O. | 
| implementation constraints: | "collective"read/write semantics should not change. | 
ADIO preallocation
| Scenario: | MPI preallocation API | 
| Bussiness Goals: | correctness of operations. | 
| Relevant QA's: | Usability | 
| Environment: | MPI enviroment | 
| Implementation: | Implement the application level preallocation for LUSTRE ADIO interface by fcntl | 
Lustre internal Optimization for parallel I/O
| Scenario: | A lot of clients accesses the shared file, and each process only access its own extent of the file. | 
| Bussiness Goals: | Reasonable extent lock policy for multi-processes accessing the shared file. | 
| Relevant QA's: | performance | 
| Environment: | MPI collective read/write API | 
| Implementation: | If each client knows its own access area(for example MPI_File_set_view), then it should send some heuristic infomation to the server when it does enqueue in the following, then server could give more reasonable extent lock according to this information. | 
HDF5 and netcdf with lustre
| Scenario: | A lot of scientific application use HDF5 and NetCDF libaray to store their data and related metadata information. | 
| Bussiness Goals: | Improve lustre performance with these libaray | 
| Relevant QA's: | performance | 
| Environment: | Scientific application | 
| Implementation: | Tried IOR(HDF5 mode) and those scientific application, and get different profiling information with different configuarion. and get the right rule of using the libaray for lustre.Both parallel netcdf and HDF5 implement their parallel I/O based on MPI parallel I/O API, so Lustre ADIO driver will also be checked to compatible with these libaraies. | 

