And, meanwhil alot of Mahout tutorials on the web attempt to explain at least 2 of the following 3 very deep topics:
- Machine Learning
- MapReduce
- Statistics
For example, in the case that you might want to create a distributed recommender, you might follow this (otherwise excellent) tutorial: http://kickstarthadoop.blogspot.com/2011/05/generating-recommendations-with-mahout_26.html. However, even though the blogpost is titled "kickstart hadoop", it actually explains the NON hadoop implementation. Doh !
So, in general, if your new to mahout, and your planning on processing terabytes (or more) of data --- you will want to be careful that you use the hadoop specific, scalable mahout APIs in writing your mahout jobs, and not just the in memory ones, which are TOTALLY different.
So, captain obvious here will point out some bullets for you: If a mahout task is going to scale then.
- Obviously will not point to any file:// urls
- Wont use much memory, and will be broken into several Mapper/Reducer segments.
- Will have a reference to hadoop somewhere in the packages.
Topic Extraction
Heres one of the better resources for understanding how to do distributed topic extraction I found: http://odbms.org/download/TamingTextCH06.pdf. This is an excerpt from the TamingText book. In particular, it describes the fact that you will need to create vectors. You might want to look into http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program while your at it. I still havent found a good library for creating sequence vectors in parallel i.e. as part of a mapreduce pipeline.
Classification
As classification is hot, and algorithmically complex - its easy to get lost in articles that describe ROC curves and correlation coefficients, etc. The reality is that most of us already know a little about this kind of thing - and in any case - learning about mahout and ROC curves at the same time (for those who have experience with neither) - is likely to cause an aneurism. If you want to get straight to the distributed part of things, check out:
http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ which backlinks to a first primer article about how to create the training set (non distributed). Combining content from those two articles yields a pretty simple workflow for classification that is implemented, for the "big" portion, in MapReduce. Creating the models doesnt really need mapreduce quite as much.
Note that the naive bayes implementation DOESNT exist for single machines, wheras the logistic regression is optimized for single machine classification.
Recommendations
There are two ways to do recommendation in Mahout : USing a distribtued, and non-distributed code path. The non-distributed code path somehow is covered ALL OVER THE PLACE, but alas, if we really needed mahout, we probably wouldnt need a non parallel implementation. Thankfully this post: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ covers some of the finer details of distributed versus non-distributed recommenders with hadoop. In particular, its the https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.html implementation that you want to use.
Great posts -- I for one would love to see more posts that point out the sharp edges to avoid, as you said.
ReplyDeletethanks matt. If you want to get in cleaning this up , id love some help on https://issues.apache.org/jira/browse/MAHOUT-1421. I dont have time to get to it right now but it would be awesome to get good docs on it
ReplyDeleteThis is good stuff! My reference for testing out Mahout :D
ReplyDeleteAnush