jayunit100: Gluster and hadoop: Using FUSE options to defy the CAP theorem.

The bigtop smoke tests are a great way to hit your cluster with a broad range of hadoop workloads.

One of the things I learned from running the diverse workloads that they offer for hadoop on top of gluster, is that gluster is really modular. Something really cool about it is that you can change its consistency model and thus its performance, really easily - without even touching the translator stack.

One of the recent hidden gems which I've uncovered is the fact that the LDA mahout test creates 1000s of small files, thus comprising you to do a small file workload performance benchmark.

Well, we found that in glusterfs-hadoop,

if you vie for strict consistency by setting timeout attrs to 0 at mount time... you get burnt on high throughput / small file puts.
BUT if you go for eventual consistency : you run into task failures with renames that occur on mapreduce workloads.

So... what do we do now? Well... theres another option: You can have both ! Just create TWO gluster mounts :) :) :) ... one for mapreduce, and one for smallfile workloads. In the end, the data all goes to the same FS, but you get much faster import of data if you dont force strict consistency.

BUT WAIT. Before you go trying this out, just a quick sanity check. You're not required to use the hadoop API for putting files into your gluster file system. So, you could just skip all of this and write directly to your FUSE mount. You can easily just use posix commands to copy directly into your mount.

Okay so this is great. But I'm only doing small file puts once, do i have to reconfigure my whole hadoop cluster for that?

NO YOU DONT !!! I hate maintaining XML files. I certainly dont want to maintain 2 XML files ! ... Hadoop runtime options to the rescue.

### for quick ETL via the hadoop API ###

mount -o entry-timeout=2,attribute-timeout=2 /mnt/glusterfsFAST
hadoop fs -Dfs.glusterfs.mount=/mnt/glusterfsFAST -put /tmp/mylocaldir/* /dfs

This will write to the EXACT SAME gluster volume, except , it will use an alternative MOUNT point.

### Meanwhile, in the core xml configs, we keep strict consistency mount ###

mount -o entry-timeout=0,attribute-timeout=0/mnt/glusterfsCONSISTENT

<property>
<name>fs.glusterfs.mount</name>
<value>/mnt/glusterfsCONSISTENT</value>
</property>

But alas, we can send these same parameters to the hadoop runtime file system command, so that we can have it write to a different mount.

Again: I'll just paste the overriding mount option example, so its lucid that you don't have to restart your cluster or modify an XML.

mount -o entry-timeout=2,attribute-timeout=2 /mnt/glusterfsFAST

hadoop fs -Dfs.glusterfs.mount=/mnt/glusterfsFAST -put /tmp/mylocaldir/* /dfs

I'll say it one more time, for people confused by all the rainbow coloring.

Well, we have one file system mount which is strictly consistent, and is used in the default hadoop configuration (which is launched in typical mapreduce jobs). But for a specific case where we dont mind loose consistency -- to put several 1000 files, we create a FAST mount, with timeout options a little looser. Initially, this seems to result in 8-10 X faster puts of files into gluster.

Its nice that gluster plays so nicely with fuse, it allows you to do alot of interesting things both in the hadoop interop, and in general, for supporting multiple real world workflows at once in a single distributed file system.

Its also nice that hadoop lets you send it arbitrary configuration changes at runtime. This means that for differing hadoop workloads, we can use differing FUSE mounts to hack around different performance bottlenecks.

This is just the beggining of my experimentation with the mount options, but to see others, check here: http://gluster.org/community/documentation/index.php/Gluster_3.2:_Setting_Volume_Options (if theres an updated 3.3 doc, somebody let me know ill correct his link) .

12.2.14

Gluster and hadoop: Using FUSE options to defy the CAP theorem.

No comments:

Post a Comment