Nice.
I'm working with Hadoop right now.
Would be nice to see it iterated with Rapid Miner.
But doesn't Hadoop lack a huge amount of algorithms?
https://cwiki.apache.org/confluence/display/MAHOUT/AlgorithmsClassification
A general introduction to the most common text classification algorithms can be found at Google Answers:
http://answers.google.com/answers/main?cmd=threadview&id=225316 For information on the algorithms implemented in Mahout (or scheduled for implementation) please visit the following pages.
Logistic Regression (SGD)
Bayesian
Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
Perceptron and Winnow (open: MAHOUT-85)
Neural Network (open, but MAHOUT-228 might help)
Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
Online Passive Aggressive (awaiting patch commit, MAHOUT-702)