WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Running Hadoop with Lustre: Difference between revisions
From Obsolete Lustre Wiki
				
				
				Jump to navigationJump to search
				
				| Line 21: | Line 21: | ||
| * Using Hadoop is time-consuming for small files. | * Using Hadoop is time-consuming for small files. | ||
| == Comparing the Performance of Lustre  | == Comparing the Performance of Lustre and HDFS == | ||
| The paper [http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf'' Using Lustre with Apache Hadoop''] provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests. | The paper [http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf'' Using Lustre with Apache Hadoop''] provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests. | ||
Revision as of 13:28, 5 August 2009
This page describes how Apache Hadoop performs with the Lustre file system when the Hadoop Distributed File System (HDFS) is replaced by Lustre.
Advantages of Using Hadoop with Lustre
Using Hadoop with Lustre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustre. Advantages include:
- Lustre is a real parallel file system, which enables temporary or intermediate data to be stored parallel in multinode (in parallel on multiple nodes?), alleviating the load of a single node (reducing the load on single nodes?).
- Lustre has its own network protocol, which is better for bulk data transfer compared to the HTTP protocol. Additionally, as a real shared file system, each client sees the same file system image, so hardlink (hardlinks?) can be used to avoid data transfer between nodes.
- Lustre is more easily? extended and can be mounted as a normal POSIX file system.
Disadvantages of Using Hadoop with HDFS
- Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on the OS/disk (OS and disk I/O?).
- During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for large data transfers because?...
- Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend HDFS? as a normal file system.
- Using Hadoop is time-consuming for small files.
Comparing the Performance of Lustre and HDFS
The paper Using Lustre with Apache Hadoop provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests.

