|RapidMiner, Process, Optimization||13 Oct 2010|
|Finding Optimal Operators and Subprocesses - Without PaREn (aka "The Naive Way") by Ingo Mierswa||
We had a discussion recently in the forum about the new PaREn extension for RapidMiner, in particular if the functionality behind the PaREn extension is something which is also part of other data mining solutions.
The PaREn Automatic System Construction Wizard is a tool for supporting you in constructing a classification process within RapidMiner. For a given data set, it automatically recommends, constructs, and optimizes a classification process based on basic characteristics of the data set. You select a data set and the PaREn extension analyzes the data and predicts the expected accuracy for a set of widely used data mining algorithms.
One of the readers in the forum compared this to the SPSS function where a set of different models is tested on a data set and the best model is automatically chosen. But this is quite a difference: in SPSS, all models are actually tested! This can, by the way, also be done with the PaREn extension during the evaluation step and it is also with a simple process as Simon has pointed out in the disucssion. And exactly this process will be shown below ;-)
The cool thing about the PaREn extension is that it predicts which model is probably the best even without any testing. This is the first time I have actually seen this meta learning approach really working and this is probably the reason why we at Rapid-I and many others love it. Kudos to Christian and the team of the DFKI for this great extension!
Ok, back to the promise that the simple approach done by SPSS is of course also possible with RapidMiner. The following process employs this more manual approach. The combination of the operator "Operator Enabler" with a grid parameter search can be used to enable / disable operators or complete subprocesses easily. For example, you could use this combination to try different model types on a given data set. In the process below, we use it to identify the difference between using normalization and skipping it before a nearest neighbors classifier:
If you use different learning schemes like let's say Naive Bayes, Decision Trees or Linear Regression instead, you would end up with exactly the same "we just try different modeling techniques" approach like the one known from SPSS. Of course the PaREn extension is much cooler but this manual approach offers a great advantage over the extension and other solutions: you can specify all different steps like the evaluation scheme as usual.
The complete process can be downloaded with our Community Extension . The name of the process is "Automatical Disabling / Enabling of Operators or Subprocesses".