Pages: [1]
  Print  
Author Topic: How to do Y-randomization in Rapidminer?  (Read 2902 times)
pengie
Newbie
*
Posts: 21


« on: October 24, 2008, 05:13:51 AM »

Hi,

I was wondering how do I do Y-randomization in Rapidminer? In Y-randomization, the y value of an example is randomly exchanged with the y value of another example. This is used in validation of QSAR models, whereby the performance of the original model (r2) is compared to that of models built for permuted (randomly shuffled) response.

Regards
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #1 on: October 27, 2008, 10:16:48 AM »

Hi,
although there is no operator for Y-Randomization in RapidMiner yet, we can make use of its modularity. I have created a process, doing Y-randomization. You could encapsulate it within an OperatorChain to use it within your process.

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="one third classification"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
        <parameter key="attribute_name_regex" value="label|id"/>
        <parameter key="condition_class" value="attribute_name_filter"/>
        <parameter key="keep_subset_only" value="true"/>
        <operator name="NoiseGenerator" class="NoiseGenerator">
            <parameter key="label_noise" value="0.0"/>
            <list key="noise">
            </list>
            <parameter key="random_attributes" value="1"/>
        </operator>
        <operator name="Sorting" class="Sorting">
            <parameter key="attribute_name" value="random"/>
        </operator>
        <operator name="IdTagging (2)" class="IdTagging">
        </operator>
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object" value="ExampleSet"/>
        <parameter key="select_which" value="2"/>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
    </operator>
    <operator name="AttributeFilter (2)" class="AttributeFilter">
        <parameter key="condition_class" value="attribute_name_filter"/>
        <parameter key="invert_filter" value="true"/>
        <parameter key="parameter_string" value="random"/>
    </operator>
</operator>

Hope that helps.


Greetings,
  Sebastian
Logged
pengie
Newbie
*
Posts: 21


« Reply #2 on: October 28, 2008, 05:46:53 AM »

Hi,

thank you for your help. The code worked perfectly. I am now trying to use Rapidminer to do y-randomization, train a model, evaluate the model using leave-one-out and repeat this 100 times to get an average classification error for the y-randomization. I am using the following code

Code:
<operator name="Root" class="Process" expanded="yes">
    <parameter key="random_seed" value="-1"/>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="one third classification"/>
    </operator>
    <operator name="RepeatUntilOperatorChain" class="RepeatUntilOperatorChain" expanded="yes">
        <parameter key="max_iterations" value="100"/>
        <operator name="IdTagging" class="IdTagging">
        </operator>
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="no">
            <parameter key="attribute_name_regex" value="label|id"/>
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="keep_subset_only" value="true"/>
            <operator name="NoiseGenerator" class="NoiseGenerator">
                <parameter key="label_noise" value="0.0"/>
                <list key="noise">
                </list>
                <parameter key="random_attributes" value="1"/>
            </operator>
            <operator name="Sorting" class="Sorting">
                <parameter key="attribute_name" value="random"/>
            </operator>
            <operator name="IdTagging (2)" class="IdTagging">
            </operator>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="select_which" value="2"/>
        </operator>
        <operator name="ExampleSetJoin" class="ExampleSetJoin">
        </operator>
        <operator name="AttributeFilter (2)" class="AttributeFilter">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="invert_filter" value="true"/>
            <parameter key="parameter_string" value="random"/>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="leave_one_out" value="true"/>
            <operator name="NearestNeighbors" class="NearestNeighbors">
                <parameter key="k" value="3"/>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                    <list key="class_weights">
                    </list>
                    <parameter key="classification_error" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>

However, it seems to give me an error about RepeatUntilOperatorChain.
Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #3 on: October 28, 2008, 10:15:48 AM »

Hi,

just a hint: why do you not use the IteratingPerformanceAverage operator which also iterates for a predifined number of times and also averages the performance vectors resulting from the inner operator chain?

Regards,
Tobias
Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
pengie
Newbie
*
Posts: 21


« Reply #4 on: October 30, 2008, 08:10:45 AM »

Great hint!

Met another error..."Message: The attribute 'random' does not exist.". Done a bit of tracing. It seems like the AttributeFilter (2) removes the attribute 'random' after the first round but on the second round, the NoiseGenerator generates attribute 'random1' instead of 'random', thus causing the error.

Code:
<operator name="Root" class="Process" expanded="yes">
    <parameter key="random_seed" value="-1"/>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="one third classification"/>
    </operator>
    <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
        <operator name="IdTagging" class="IdTagging">
        </operator>
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
            <parameter key="attribute_name_regex" value="label|id"/>
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="keep_subset_only" value="true"/>
            <operator name="NoiseGenerator" class="NoiseGenerator" breakpoints="after">
                <parameter key="label_noise" value="0.0"/>
                <list key="noise">
                </list>
                <parameter key="random_attributes" value="1"/>
            </operator>
            <operator name="Sorting" class="Sorting">
                <parameter key="attribute_name" value="random"/>
            </operator>
            <operator name="IdTagging (2)" class="IdTagging">
            </operator>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="select_which" value="2"/>
        </operator>
        <operator name="ExampleSetJoin" class="ExampleSetJoin">
        </operator>
        <operator name="AttributeFilter (2)" class="AttributeFilter">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="invert_filter" value="true"/>
            <parameter key="parameter_string" value="random"/>
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="leave_one_out" value="true"/>
            <operator name="NearestNeighbors" class="NearestNeighbors">
                <parameter key="k" value="3"/>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="no">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                    <list key="class_weights">
                    </list>
                    <parameter key="classification_error" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #5 on: October 31, 2008, 11:54:45 AM »

Hi,
try to use our Permutation Operator. I forgot it myself in the previous solution. So many Operators... Smiley

Code:
<operator name="Root" class="Process" expanded="yes">
    <parameter key="random_seed" value="-1"/>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function" value="one third classification"/>
    </operator>
    <operator name="IteratingPerformanceAverage" class="IteratingPerformanceAverage" expanded="yes">
        <operator name="IdTagging" class="IdTagging">
        </operator>
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="AttributeSubsetPreprocessing" class="AttributeSubsetPreprocessing" expanded="yes">
            <parameter key="attribute_name_regex" value="label|id"/>
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="keep_subset_only" value="true"/>
            <operator name="Permutation" class="Permutation">
            </operator>
            <operator name="IdTagging (2)" class="IdTagging">
            </operator>
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="select_which" value="2"/>
        </operator>
        <operator name="ExampleSetJoin" class="ExampleSetJoin">
        </operator>
        <operator name="XValidation" class="XValidation" expanded="yes">
            <parameter key="leave_one_out" value="true"/>
            <operator name="NearestNeighbors" class="NearestNeighbors">
                <parameter key="k" value="3"/>
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="no">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="ClassificationPerformance" class="ClassificationPerformance">
                    <list key="class_weights">
                    </list>
                    <parameter key="classification_error" value="true"/>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>


This should help.

Greetings,
  Sebastian
Logged
pengie
Newbie
*
Posts: 21


« Reply #6 on: November 03, 2008, 02:13:39 AM »

Thank you so much. It worked perfectly.  Grin

Just one last question, when I do a breakpoint in ExampleSetJoin, I noticed that the id number of the dataset keeps increasing. Why is that so and will it have any impact on the memory?
« Last Edit: November 03, 2008, 02:59:23 AM by pengie » Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #7 on: November 03, 2008, 12:54:40 PM »

Hi,
no this won't increase the memory consumption. Memory of ExampleSets will be freed, if no ExampleSet exists adressing this memory. Keep in mind, that it have not be freed immediately. Java will free its memory when it thinks thats appropriate or needs it.

Greetings,
  Sebastian
Logged
Pages: [1]
  Print  
 
Jump to: