WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Difference between revisions of "Running Hadoop with Lustre"

From Obsolete Lustre Wiki
Jump to navigationJump to search
 
(12 intermediate revisions by one other user not shown)
Line 1: Line 1:
This page provides access to information about how to integrate Apache Hadoop with Lustre. We have made several enhancements to improve the use of Hadoop with Lustre and have conducted performance tests to compare the performance of Lustre vs. HDFS when used with Hadoop.
+
<small>''(Updated: Jan 2010)''</small>
  
== Advantages of Using Hadoop with Lustre ==
+
This project was conducted by three interns as a proof of concept study for how to integrate Apache Hadoop with Lustre™. Although the performance tests, conducted by the interns to compare Lustre to HDFS when used with Hadoop, did not show a significant performance benefit for Lustre, this study has laid the groundwork for further work in this area.
  
Using Hadoop with Lustre provides several advantages including:
+
== Benefits of Using Hadoop with Lustre ==
  
* Lustre is a real parallel file system, which enables temporary or intermediate data to be stored [[parallel in multinode]] '''[[(in parallel on multiple nodes?)]]''', [[alleviating the load of a single node]] '''[[(reducing the load on single nodes?)]]'''.
+
Using Hadoop with Lustre provides several benefits, including:
  
* Lustre has its own network protocol, which is [[better]] '''[[(more efficient?)]]''' for bulk data transfer than the HTTP protocol. Additionally, because Lustre is a shared file system, each client sees the same file system image, so [[hardlink]] '''[[(hardlinks?)]]''' can be used to avoid data transfer between nodes.
+
* Lustre is a real parallel file system, which enables temporary or intermediate data to be stored in parallel on multiple nodes reducing the load on single nodes.  
  
* Lustre is more '''[[easily?]]''' extended and can be mounted as a normal POSIX file system.
+
* Lustre has its own network protocol, which is more efficient for bulk data transfer than the HTTP protocol. Additionally, because Lustre is a shared file system, each client sees the same file system image, so hardlinks can be used to avoid data transfer between nodes.
  
== Disadvantages of Using Hadoop with HDFS ==
+
* Lustre can be mounted as a standard POSIX file system.
 +
 
 +
== Drawbacks of Using Hadoop with HDFS ==
  
 
Using Hadoop with HDFS has several drawbacks, including:
 
Using Hadoop with HDFS has several drawbacks, including:
  
* Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on [[the OS/disk]] '''[[(OS and disk I/O?)]]'''.
+
* Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on OS and disk I/O.
  
* During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is [[not a good choice]] for large data transfers '''[[because?...]]'''
+
* During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for large data transfers because it does not support the RDMA protocol, which is commonly used and often required for current distributed file systems.  
  
* Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend '''[[HDFS?]]''' as a normal file system.
+
* Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend HDFS as a normal file system.
  
 
* Using Hadoop is time-consuming for small files.
 
* Using Hadoop is time-consuming for small files.
Line 25: Line 27:
 
== Comparing the Performance of Lustre and HDFS ==
 
== Comparing the Performance of Lustre and HDFS ==
  
The paper [http://wiki.lustre.org/images/1/1b/Hadoop_wp_v0.4.2.pdf'' Using Lustre with Apache Hadoop''] provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests.
+
The paper [[Media:Hadoop_wp_v0.4.2.pdf|''Using Lustre with Apache Hadoop'']] provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests.

Latest revision as of 12:00, 22 February 2010

(Updated: Jan 2010)

This project was conducted by three interns as a proof of concept study for how to integrate Apache Hadoop with Lustre™. Although the performance tests, conducted by the interns to compare Lustre to HDFS when used with Hadoop, did not show a significant performance benefit for Lustre, this study has laid the groundwork for further work in this area.

Benefits of Using Hadoop with Lustre

Using Hadoop with Lustre provides several benefits, including:

  • Lustre is a real parallel file system, which enables temporary or intermediate data to be stored in parallel on multiple nodes reducing the load on single nodes.
  • Lustre has its own network protocol, which is more efficient for bulk data transfer than the HTTP protocol. Additionally, because Lustre is a shared file system, each client sees the same file system image, so hardlinks can be used to avoid data transfer between nodes.
  • Lustre can be mounted as a standard POSIX file system.

Drawbacks of Using Hadoop with HDFS

Using Hadoop with HDFS has several drawbacks, including:

  • Hadoop sometimes generates a large amount of temporary or intermediate data during the Map/Reduce process. HDFS stores these files on the local disk, which results in a considerable load on OS and disk I/O.
  • During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for large data transfers because it does not support the RDMA protocol, which is commonly used and often required for current distributed file systems.
  • Hadoop is designed for Map/Reduce jobs, which makes it difficult to extend HDFS as a normal file system.
  • Using Hadoop is time-consuming for small files.

Comparing the Performance of Lustre and HDFS

The paper Using Lustre with Apache Hadoop provides an overview of Hadoop and HDFS and describes how to set up Lustre with Hadoop. It also provides performance results for several categories of test cases including application tests (word counts, reading and outputting large non-splittable files), sub-phase tests to identify the time-consuming phases for specific applications, and benchmark I/O tests.