Pages: [1]
  Print  
Author Topic: Discretize by Entropy not working properly?  (Read 2012 times)
miguelbiron
Newbie
*
Posts: 1


« on: November 07, 2012, 09:41:59 PM »

Hello,

I'm doing some experimental tests on the capabilities of this software, which is apparently really great for datamining tasks, and I'm encountering a problem when using the "Discretize by Entropy" operator. Using the Iris Database, I apply the latter function and get that the two most powerful features, namely "Petal Width" and "Petal Length" (called "a3" and "a4" in the sample database that comes with Rapidminer), get erased by this operator as "useless atributes". This is nonsense (or I'm really missing something), since those attributes get selected by any method of attribute selection, or like i did, using "Decision Tree" operator, they are the only ones used on the resulting tree.

I looked all over the forum and googled, but couldn't find the answer. Interestingly, Weka uses a similar procedure called "Discretize", and it works great, but sadly it doesn't come with the implementation Rapidminer has.

Thanks, and sorry for the poor english...

P.S: this is the XML code of the procedure i'm experimenting with

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="386" width="614">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
      </operator>
      <operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="94" name="Multiply" width="90" x="179" y="120"/>
      <operator activated="true" class="decision_tree" compatibility="5.2.008" expanded="true" height="76" name="Decision Tree" width="90" x="375" y="155"/>
      <operator activated="true" class="discretize_by_entropy" compatibility="5.2.008" expanded="true" height="94" name="Discretize" width="90" x="380" y="30">
        <parameter key="attributes" value="lapiz|peo|"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_port="result 3"/>
      <connect from_op="Discretize" from_port="example set output" to_port="result 2"/>
      <connect from_op="Discretize" from_port="original" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
Logged
SebastianBerlin
Newbie
*
Posts: 1


« Reply #1 on: December 17, 2012, 03:29:43 PM »

Hello,

I just started to use RapidMiner after several years of working with Weka.  I am experiencing the same problem with the entropy-based disretization.

Since the entropy-based descretization of Irani and Fayyad is extremly helpful for learners such as NB or J48, it would be nice if this problem would be fixed or, at least, the Weka discretization would be included.

Cheers,
Sebastian
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #2 on: December 18, 2012, 02:31:51 PM »

Hi,

you are right, something seems to be wrong. I created an internal issue for this operator. Thanks for reporting!

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
Mario Hofmann
Newbie
*
Posts: 9


« Reply #3 on: December 18, 2012, 03:33:07 PM »

Today I compared the results of this operator to a similar operator in spss. Most attributes where handled very similar, but rapidminer (or spss) seemed to be 1 off very often. There was no real structure in it, but there might be differences in the rounding of values or in the granularity they are handled.

Regards,

Mario
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #4 on: December 18, 2012, 03:35:22 PM »

The problem seems to occur on very pure split points (e.g. in Iris on a3). In that case the calculation would include the logarithm of 0, which is undefined and needs some special handling.
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
Simon Fischer
Administrator
Sr. Member
*****
Posts: 448



WWW
« Reply #5 on: September 24, 2013, 08:50:47 AM »

The problem has been fixed and will be part of the next release.
Logged

Simon Fischer, Rapid-I
RapidMiner Development on Twitter: @simon_fis
Pages: [1]
  Print  
 
Jump to: