Running Hadoop with Lustre: Difference between revisions

Revision as of 16:36, 4 August 2009

This page describes how Hadoop performs with the Lustre file system after the Hadoop Distributed File System (HDFS) is replaced with Lustre.

Sometimes, Hadoop generates a large amount of temporary/intermediate data during the Map/Reduce process. HDFS stores these files in the local disk, which results in a considerable load on the OS/disk.

During the Map/Reduce process, the Reduce node uses the HTTP protocol to retrieve Map results from the Map node protocol. The HTTP protocol is not a good choice for big data transfers.

Hadoop is designed for Map/Reduce jobs, which makes it hard to extend as a normal file system.

Using Hadoop with Lsutre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustr.

Lustre is a real parallel file system, which enables temporary/intermediate data to be stored parallel in multinode, alleviating the load of a single node.

Lustre has its own network protocol, which is better for bulk data transfer as compared with the HTTP protocol. Additionally, as a real shared file system, each client sees the same file system image, so hardlink can be used to avoid data transfer between nodes.

This paper provides suggestions about how to set up Lustre with Hadoop and how to use stripe information to help Hadoop schedule the job.

@@ Line 11: / Line 11: @@
 * Hadoop is time-consuming for small files.
-== Advantages of Using Lustre with Hadoop ==
+== Advantages of Using Hadoop with Lustre ==
-Using Lustre with Hadoop offers several advantages over HDFS. We have made several enhancements to improve the use of Lustre with Hadoop.
+Using Hadoop with Lsutre offers several advantages over HDFS. We have made several enhancements to improve the use of Hadoop with Lustr.
 * Lustre is a real parallel file system, which enables temporary/intermediate data to be stored parallel in multinode, alleviating the load of a single node.