Pages: [1] 2
  Print  
Author Topic: PaREn Extension  (Read 8504 times)
dragonedison
Newbie
*
Posts: 17


« on: September 18, 2010, 08:39:42 AM »

Dear everyone,

I found the new update for RapidMiner includes the PaREn Extension, which claims that it can suggest a most suitable classification method for the dataset. I would like very much to know how to use this extension.

Regards,
Gary
Logged
dan_
Full Member
***
Posts: 114


« Reply #1 on: September 19, 2010, 04:43:14 PM »


Hi,

Try this

http://madm.dfki.de/rapidminer/wizard

However, perhaps some fixing may still be needed; I have tried to follow the guidelines in a simple test and was not successful in running it till the end.

Regards
Dan
Logged
awchisholm
Sr. Member
****
Posts: 369


WWW
« Reply #2 on: September 19, 2010, 08:38:16 PM »

Hello all,

I found the LandMarking operator doesn't work out of the box but by deselecting the "Linear Discriminant" check box I got a successful run.

Here's an example that predicts the KNN operator will do best on the Sonar data set and lo and behold it seems to - so that's quite cool.

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.10" expanded="true" name="Process">
    <process expanded="true" height="557" width="614">
      <operator activated="true" class="retrieve" compatibility="5.0.10" expanded="true" height="60" name="Sonar data set" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.0.10" expanded="true" height="130" name="Multiply" width="90" x="45" y="210"/>
      <operator activated="true" class="x_validation" compatibility="5.0.10" expanded="true" height="112" name="Decision Tree (2)" width="90" x="179" y="390">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="549" width="310">
          <operator activated="true" class="decision_tree" compatibility="5.0.10" expanded="true" height="76" name="Decision Tree" width="90" x="112" y="30"/>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="549" width="310">
          <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model (3)" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance (Decision Tree)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (Decision Tree)" to_port="labelled data"/>
          <connect from_op="Performance (Decision Tree)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.10" expanded="true" height="112" name="Naive Bayes" width="90" x="179" y="255">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="396" width="301">
          <operator activated="true" class="naive_bayes_kernel" compatibility="5.0.10" expanded="true" height="76" name="Naive Bayes (Kernel)" width="90" x="110" y="30"/>
          <connect from_port="training" to_op="Naive Bayes (Kernel)" to_port="training set"/>
          <connect from_op="Naive Bayes (Kernel)" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="396" width="301">
          <operator activated="true" class="apply_model" compatibility="5.0.10" expanded="true" height="76" name="Apply Model (2)" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.10" expanded="true" height="76" name="Performance (Naive Bayes)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (Naive Bayes)" to_port="labelled data"/>
          <connect from_op="Performance (Naive Bayes)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="KNN" width="90" x="179" y="120">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="k_nn" compatibility="5.0.10" expanded="true" height="76" name="k-NN" width="90" x="179" y="30"/>
          <connect from_port="training" to_op="k-NN" to_port="training set"/>
          <connect from_op="k-NN" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Performance (KNN)" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (KNN)" to_port="labelled data"/>
          <connect from_op="Performance (KNN)" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="paren:landmarking" compatibility="5.0.0" expanded="true" height="60" name="LandMarking" width="90" x="179" y="30">
        <parameter key="Linear Discriminant" value="false"/>
        <parameter key="Cross-validation" value="true"/>
        <parameter key="Normalize Dataset" value="false"/>
      </operator>
      <connect from_op="Sonar data set" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="LandMarking" to_port="exampleset"/>
      <connect from_op="Multiply" from_port="output 2" to_op="KNN" to_port="training"/>
      <connect from_op="Multiply" from_port="output 3" to_op="Naive Bayes" to_port="training"/>
      <connect from_op="Multiply" from_port="output 4" to_op="Decision Tree (2)" to_port="training"/>
      <connect from_op="Decision Tree (2)" from_port="averagable 1" to_port="result 4"/>
      <connect from_op="Naive Bayes" from_port="averagable 1" to_port="result 3"/>
      <connect from_op="KNN" from_port="averagable 1" to_port="result 2"/>
      <connect from_op="LandMarking" from_port="exampleset" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>



Andrew
Logged

dragonedison
Newbie
*
Posts: 17


« Reply #3 on: September 20, 2010, 03:44:04 AM »

Dear Dan,

Thank you! The link is exactly what I need.

Regards,
Gary
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #4 on: September 22, 2010, 08:20:21 AM »

Hi,
we are in contact with the guys from DFKI contributing this extension. They found out, it runs fine under linux but fails on windows machines. We will publish a new version as soon as possible.

Greetings,
  Sebastian
Logged
Simon Fischer
Administrator
Sr. Member
*****
Posts: 448



WWW
« Reply #5 on: September 23, 2010, 01:37:47 PM »

Hi all,

the fix is on the update server.

Best,
Simon
Logged

Simon Fischer, Rapid-I
RapidMiner Development on Twitter: @simon_fis
NeuralMarket
Newbie
*
Posts: 13



WWW
« Reply #6 on: September 23, 2010, 04:16:32 PM »

Ah so I wasn't the only one crashing this plugin on a Windows machine.  Thanks for the quick fix guys.

Thanks,
Tom

Logged

Tom
Neural Market Trends
www.neuralmarkettrends.com
dan_
Full Member
***
Posts: 114


« Reply #7 on: September 24, 2010, 04:06:35 PM »

Hi,

It is a great and very useful initiative to provide such an extension as PaREn. This kind of feature is included in other major DM software, so it was time. Many thanks to the PaREn team! Smiley

I have tested this feature again since operational on Windows machines, and would like to make some constructive comments that, added to those to follow from other guys, would hopefully be a useful feedback to the developers, for future improvements.

Using a dataset of 1000 rows with a binominal label, the accuracy of a PaREn optimised classifier based on decision trees was 0.692, actually under the accuracy 0.726 of the elementary zeroR model (based on taking the mode as the prediction in all cases). Separately I built a decision tree at a glance, that gave an accuracy of 0.737 - very small improvement, model that was tested via cross validation.

Not sure if the current order in which the figures are is statistically significant, but anyway, one would normally expect the PaREn optimised classifier to outperform both the subsequent DT and the trivial model blindly predicting the most frequent class.

Any other guys with comments on their results?

BTW, most probably the answer is yes - but could the PaREn team tell us whether they made use of the ROC analysis implemented in RM, among others, to optimise accuracy? Thanks.

Regards,
Dan
« Last Edit: October 01, 2010, 09:54:11 AM by dan_agape » Logged
faisalshafait
Newbie
*
Posts: 2


« Reply #8 on: September 26, 2010, 07:54:38 AM »

Hi Dan,

Quote
It is a great and very useful initiative to provide such an extension as PaREn. This kind of feature is included in other major DM software, so it was time. Many thanks to the PaREn team! Smiley
Thanks for your encouraging remarks. Can you please point to some DM software that has similar functionality?

Quote
Not sure if the current order, in which the figures are, is statistically significant, but anyway one would normally expect the PaREn optimised classifier to outperform both the subsequent DT and the trivial model blindly predicting the most frequent class.
You are right. Generally, optimized classifiers should perform better than a manually tuned one. However, currently we are doing a coarse grid search for a few parameters while using default values for others. In case of decision trees, search is just limited to the 'confidence' parameter. Any suggestions about which parameters to optimize are welcome.

Quote
could the PaREn team tell us whether they made use of the ROC analysis implemented in RM, among others, to optimise accuracy?
No, we are simply using classification accuracy for optimization purpose.

Cheers,
Faisal
Logged
NeuralMarket
Newbie
*
Posts: 13



WWW
« Reply #9 on: October 01, 2010, 02:25:12 PM »

Faisal,

Thanks so much for providing this plugin! It really helps me in my data discovery tasks.

Regards,
Tom
www.neuralmarkettrends.com
Logged

Tom
Neural Market Trends
www.neuralmarkettrends.com
dan_
Full Member
***
Posts: 114


« Reply #10 on: October 06, 2010, 04:39:49 PM »


Hi Faisal,

Quote
Thanks for your encouraging remarks. Can you please point to some DM software that has similar functionality?

A similar (though not identical) feature, very effective indeed, is offered by IBM SPSS Modeler for instance as an automatic modeling operator, via which several models are produced automatically, and the best of them are proposed to the user. Moreover, the models may be combined to produce a kind of voting model, which may have better performance in some occasions than the individual models. See a demo here.

http://www.spss.com/media/demos/modeler/demo-modeler-overview/index.htm

Since you asked for suggestions, perhaps you can offer an option expressing how much the models are to be optimised, so that results can be produced in shorter or longer times upon choice. For practical reasons one can offer 3 levels for instance: low, medium, high levels of optimisation (corresponding processing times will increase accordingly). This would offer a balance between processing time and model performance (one of my tests on a dataset of 1000 rows was quite long to run and sometimes we may want to reduce this time).

Also, you may wish to automatically select the best 2-3 models and offer their respective RM processes, or alternatively one may build a process in which these models are put to vote, etc. Potentially your add-in can bring a lot of help to data miners. Thanks again and good luck!

Best,
Dan
Logged
Simon Fischer
Administrator
Sr. Member
*****
Posts: 448



WWW
« Reply #11 on: October 07, 2010, 07:48:28 AM »

Hi,

just as an aside: Trying different models on a data set is easily possible using a combination of parameter optimization/subprocess selector. Maybe we should have a sample or building block for that :-)

Best,
Simon
Logged

Simon Fischer, Rapid-I
RapidMiner Development on Twitter: @simon_fis
Christian Kofler
Newbie
*
Posts: 4


WWW
« Reply #12 on: October 07, 2010, 09:42:36 AM »

Hi,

thanks for the feedback.

Concerning the run-time of the evaluation (which includes optimization):
We are actually working on the prediction of the run-time as well. For each of the listed classifiers in the wizard you can then not only see the predicted accuracy but also the expected run-time for training on the given data. This should help a lot when certain constraints have to be met, e.g. on embedded systems (where computational power is limited) and if you want to choose a classifier with reasonable performance but also low energy consumption. Maybe we should try to trademark "Green Data Mining" before releasing the next version of the PaREn Automatic System Construction Wizard Wink

Hm, the discussion has not much to do with "Problems and Support" - and I am really happy about this!
Anyway, if you experience any issues, please let us know.


Cheers

Christian
Logged
tolau100
Newbie
*
Posts: 4


« Reply #13 on: October 07, 2010, 10:58:26 AM »

Just wanna say THANKS to everyone establishing this wonderful tool. For me as a newbie in Rapidminer the automatic 'pre-'prediction and processing saves plenty of time I would have spent handling all settings in the normal GUI.

Since I've no improvements to add: Best wishes  Smiley
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #14 on: October 07, 2010, 01:30:08 PM »

Hi all,
if I read this thread, I feel honored that all this discussion takes place in the Problems and Support forum moderated by me, but I wonder if it would be a good idea adding a new forum explicitly for the paren extension. What do you mean?

@Christian
If you are going to estimate the runtime of an operator, it might be useful to contact us. We have been working on the same issue for a while and probably can provide you with some help on that. Might be it would be a good idea to join our new Special Interest Group for Development of RapidMiner. I think you left RCOMM before we established them on the last day, is that possible?

Greetings,
  Sebastian
Logged
Pages: [1] 2
  Print  
 
Jump to: