WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - MPI IO and NetCDF

From Obsolete Lustre Wiki
Jump to navigationJump to search

Definitions

ADIO - Abstract device interface for parallel I/O. Here, it specially means ADIO driver inside MPI.
HDF5 - Hierarchical Data Format (HDF5) is a set of software libraries and machine-independent standard for storing scientific data(metadata and array data) in files. Here, it specially means the HDF5 library.
NetCDF - NetCDF (network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. Similar as HDF5, but do not support parallel I/O.

Background

Good Parallel I/O performance on lustre not only depends on the proper I/O pattern of the software, but also depends on the good behavior of special filesystem and its ADIO driver. This paper will discuss this topic in three fields.

  1. Lustre ADIO driver improvements.
  2. Lustre filesystem internal improvments for parallel I/o.
  3. HDF5 and netcdf with lustre

Lustre ADIO driver improvements

Use cases

ADIO collective Open performance Processes from different clients open the shared file at the same time.
ADIO collective Open/Create performance Processes from different clients open/create its own files under the same dir at the same time.
ADIO collective I/O performance Processes from different clients read/write the shared file at the same time
ADIO preallocation usability Preallocate the space for the application in user level

ADIO collective open

Scenario: MPI collective open call this collective open API to open the file
Bussiness Goals: performance
Relevant QA's: performace
Environment: MPI enviroment
Implementation: One client will do lookup and open. then retrieve back lustre open handle(parent fid, child fid) of this object. and distributed this handle to other clients. Other clients, received this handle, will call an ioctl or open with special flag with the handle and go to mds to open the handle directly(without lookup). And also check whether the child fid is correctly.
implementation constraints: Only lookup process occupies a lot of the whole open process(for example > 30%), then collective open should be used. "Collective open" should support recovery. "collective open" semantics should not be changed.

ADIO collective open/create

ADIO collective open/create

Scenario: MPI collective open/create will this collective open API
Business Goals: performance
Relevant QA's: performance
Environment: MPI environment
Implementation: One client (or several choosed clients) will gather the create information(name, dir) from other clients. then it(or they) will send the "collecting" creation req to MDS, MDS might need a new handler to handle "collective" create request, then distributed the created handle(fid, name, parent_handle) to other clients. Then other clients will go to mds with the handle and open that.
implementation constraints: "Collective open/create" will be choosen only there is a lot processes (>500) create the files under the same dir at the same time. "collective open/create" must support recovery.keep MPI collective open/create semantics unchanged.

ADIO collective I/O

Scenario: MPI collective read/write API
Business Goals: Performance and find right parallel I/O hints for lustre,
Relevant QA's: performance
Environment: MPI environment
Implementation: Adjust I/O segment_size(ind_wr/rd_buffer_size, cb_buffer_size) automatically for some application, Avoid "flock" in read-modify-write process of "collective" I/O.
implementation constraints: "collective"read/write semantics should not change.

ADIO preallocation

Scenario: MPI preallocation API
Bussiness Goals: correctness of operations.
Relevant QA's: Usability
Environment: MPI enviroment
Implementation: Implement the application level preallocation for LUSTRE ADIO interface by fcntl

Lustre internal Optimization for parallel I/O

Scenario: A lot of clients accesses the shared file, and each process only access its own extent of the file.
Bussiness Goals: Reasonable extent lock policy for multi-processes accessing the shared file.
Relevant QA's: performance
Environment: MPI collective read/write API
Implementation: If each client knows its own access area(for example MPI_File_set_view), then it should send some heuristic infomation to the server when it does enqueue in the following, then server could give more reasonable extent lock according to this information.

HDF5 and netcdf with lustre

Scenario: A lot of scientific application use HDF5 and NetCDF libaray to store their data and related metadata information.
Bussiness Goals: Improve lustre performance with these libaray
Relevant QA's: performance
Environment: Scientific application
Implementation: Tried IOR(HDF5 mode) and those scientific application, and get different profiling information with different configuarion. and get the right rule of using the libaray for lustre.Both parallel netcdf and HDF5 implement their parallel I/O based on MPI parallel I/O API, so Lustre ADIO driver will also be checked to compatible with these libaraies.