WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Running Hadoop with Lustre: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
No edit summary
Line 1: Line 1:
This page describes how Hadoop performs with the Lustre file system after the Hadoop Distributed File System (HDFS) is replaced with Lustre.
This page describes how Hadoop performs with the Lustre file system when the Hadoop Distributed File System (HDFS) is replaced by Lustre.


== Disadvantages of Using Hadoop with HDFS ==
== Advantages of Using Hadoop with Lustre ==


* Sometimes, Hadoop generates a large amount of temporary/intermediate data during the Map/Reduce process. HDFS stores these files in the local disk, which results in a considerable load on the OS/disk.
Using Hadoop with Lustre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustre. Advantages include:


* During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for big data transfers.
* Lustre is a real parallel file system, which enables temporary or intermediate data to be stored parallel in multinode, alleviating the load of a single node.  


* Hadoop is designed for Map/Reduce jobs, which makes it hard to extend as a normal file system.
* Lustre has its own network protocol, which is better for bulk data transfer compared to the HTTP protocol. Additionally, as a real shared file system, each client sees the same file system image, so [[hardlink]] '''hardlinks?''' can be used to avoid data transfer between nodes.


* Hadoop is time-consuming for small files.
* Lustre is more '''easily?''' extended and can be mounted as a normal POSIX file system.


== Advantages of Using Hadoop with Lustre ==
== Disadvantages of Using Hadoop with HDFS ==


Using Hadoop with Lustre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustre.
* Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on the OS/disk.


* Lustre is a real parallel file system, which enables temporary/intermediate data to be stored parallel in multinode, alleviating the load of a single node.  
* During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for big data transfers '''because?...'''


* Lustre has its own network protocol, which is better for bulk data transfer as compared with the HTTP protocol. Additionally, as a real shared file system, each client sees the same file system image, so hardlink can be used to avoid data transfer between nodes.
* Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend '''Hadoop?''' as a normal file system.


* Lustre is more extended and can be mounted as a normal POSIX file system.
* Using Hadoop is time-consuming for small files.


== Test Comparisons Between Lustre vs HDFS ==
== Test Comparisons Between Lustre vs HDFS ==

Revision as of 11:45, 5 August 2009

This page describes how Hadoop performs with the Lustre file system when the Hadoop Distributed File System (HDFS) is replaced by Lustre.

Advantages of Using Hadoop with Lustre

Using Hadoop with Lustre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustre. Advantages include:

  • Lustre is a real parallel file system, which enables temporary or intermediate data to be stored parallel in multinode, alleviating the load of a single node.
  • Lustre has its own network protocol, which is better for bulk data transfer compared to the HTTP protocol. Additionally, as a real shared file system, each client sees the same file system image, so hardlink hardlinks? can be used to avoid data transfer between nodes.
  • Lustre is more easily? extended and can be mounted as a normal POSIX file system.

Disadvantages of Using Hadoop with HDFS

  • Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on the OS/disk.
  • During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for big data transfers because?...
  • Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend Hadoop? as a normal file system.
  • Using Hadoop is time-consuming for small files.

Test Comparisons Between Lustre vs HDFS

This paper provides suggestions about how to set up Lustre with Hadoop and how to use stripe information to help Hadoop schedule the job.

Using Lustre with Hadoop