jayunit100: Running CDH5 on GlusterFS

I have recently spent some time getting Cloudera’s CDH 5 distribution
 of Apache Hadoop to work on GlusterFS using Distributed Replicated 2
 Volumes. This is made possible by the fact that Apache Hadoop has a 
pluggable filesystem architecture that allows the computational 
components within the CDH 5 distribution to be configured to use 
alternative filesystems to HDFS. In this case, one can configure CDH 5 
to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), 
which allows it to run on Gluster.  I’ve provided a diagram below 
that illustrates the CDH 5 core processes and how they interact with 
GlusterFS.

Running a Single CDH 5 Deployment on One or More GlusterFS Volumes

Given that the CDH 5 distribution is comprised of other components 
besides YARN and MapReduce,

I used the Apache Bigtop System Testing 
Framework to explicitly validate that Apache Sqoop, Apache Flume, Apache
 Pig, Apache Hive, Apache Oozie, Apache Mahout, Apache ZooKeeper, Apache
 Solr and Apache HBase also ran successfully.  Work is Still in Progress to Enable the Use of Impala.  

 If you would like to participate in accelerating the work on Impala, please reach out to us on the Gluster mailing list.

Implementation details for this solution and the specific setup required for all the components are available on the glusterfs-hadoop project wiki. If
 you have additional questions, feel free to reach out to me on FreeNode
 (IRC handle jayunit100), @jayunit100 on twitter, or via the Gluster 
mailing list.

15.8.14

Running CDH5 on GlusterFS

No comments:

Post a Comment