jayunit100: A few nice posts about distirbuted parts of Mahout implementations

In this post I'll try to keep a running tally of distinctions between distributed and non distributed algorithm implementations in mahout, as it can be tricky to keep track of sometimes. When using mahout its important to know what implementation of an algorithm to use. A little known fact of mahout is that it has both distributed and single machine optimized implementations for particular algorithms (i.e. naive bayes - only runs on clusters, but logistic regression in mahout is optimized for single machines).

And, meanwhil alot of Mahout tutorials on the web attempt to explain at least 2 of the following 3 very deep topics:

Machine Learning
MapReduce
Statistics

In a matter of a few pages. This is IMPOSSIBLE. Its also confounding.

For example, in the case that you might want to create a distributed recommender, you might follow this (otherwise excellent) tutorial: http://kickstarthadoop.blogspot.com/2011/05/generating-recommendations-with-mahout_26.html. However, even though the blogpost is titled "kickstart hadoop", it actually explains the NON hadoop implementation. Doh !

So, in general, if your new to mahout, and your planning on processing terabytes (or more) of data --- you will want to be careful that you use the hadoop specific, scalable mahout APIs in writing your mahout jobs, and not just the in memory ones, which are TOTALLY different.

So, captain obvious here will point out some bullets for you: If a mahout task is going to scale then.

Obviously will not point to any file:// urls
Wont use much memory, and will be broken into several Mapper/Reducer segments.
Will have a reference to hadoop somewhere in the packages.

Docs are not quite as easy to find on the distribtued mahout implementations, ironically... So, here are some posts on the distributed variety of mahout jos that I dug up.

Topic Extraction

Heres one of the better resources for understanding how to do distributed topic extraction I found: http://odbms.org/download/TamingTextCH06.pdf. This is an excerpt from the TamingText book. In particular, it describes the fact that you will need to create vectors. You might want to look into http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program while your at it. I still havent found a good library for creating sequence vectors in parallel i.e. as part of a mapreduce pipeline.

Classification

As classification is hot, and algorithmically complex - its easy to get lost in articles that describe ROC curves and correlation coefficients, etc. The reality is that most of us already know a little about this kind of thing - and in any case - learning about mahout and ROC curves at the same time (for those who have experience with neither) - is likely to cause an aneurism. If you want to get straight to the distributed part of things, check out:
http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ which backlinks to a first primer article about how to create the training set (non distributed). Combining content from those two articles yields a pretty simple workflow for classification that is implemented, for the "big" portion, in MapReduce. Creating the models doesnt really need mapreduce quite as much.

Note that the naive bayes implementation DOESNT exist for single machines, wheras the logistic regression is optimized for single machine classification.

Recommendations

There are two ways to do recommendation in Mahout : USing a distribtued, and non-distributed code path. The non-distributed code path somehow is covered ALL OVER THE PLACE, but alas, if we really needed mahout, we probably wouldnt need a non parallel implementation. Thankfully this post: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ covers some of the finer details of distributed versus non-distributed recommenders with hadoop. In particular, its the https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.html implementation that you want to use.

3 comments:

UnknownMarch 3, 2014 at 10:40 AM
Great posts -- I for one would love to see more posts that point out the sharp edges to avoid, as you said.
jayunit100March 3, 2014 at 11:11 AM
thanks matt. If you want to get in cleaning this up , id love some help on https://issues.apache.org/jira/browse/MAHOUT-1421. I dont have time to get to it right now but it would be awesome to get good docs on it
AnonymousMarch 5, 2014 at 6:20 AM
This is good stuff! My reference for testing out Mahout :D

Anush

24.2.14

A few nice posts about distirbuted parts of Mahout implementations

3 comments: