Pages: [1]
  Print  
Author Topic: Stratification: How to get the same number of examples for each class?  (Read 11243 times)
JohnQuest
Newbie
*
Posts: 15


« on: June 10, 2010, 07:49:12 AM »

I have a data set of 2 labels, label A(6000 items), label B(500items).
I want to run a 10-fold cross validation but with sampling. For example: the 1st fold has 600 of label A and 50 of label B. we want to sample 50 label A out and create a new 1st fold with 50 label A and 50 label B. Same process for rest of 8 folds and we use 9 folds together to training and 1 fold of non-sampled data to testing. The process loop through for the entire data set and collect the performance.

So far I am able to do the above process one fold by one fold which is time consuming. I was hoping to set up a process to do them automatically.

Thanks in advance for your support Smiley

John Quest
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #1 on: June 10, 2010, 08:41:25 AM »

Hi,

There is no need to repeat your question. What is the difference between doing what you describe and using standard XValidation with stratified sampling, applied on an example set with 50% label A and 50% label B? If you post your XML people will take more interest.

« Last Edit: June 10, 2010, 09:29:37 AM by haddock » Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
JohnQuest
Newbie
*
Posts: 15


« Reply #2 on: June 10, 2010, 12:46:49 PM »

my set up is as follows, I am wondering how to make operator "sample" automatically set the sample size according to the size of operator "filter sample" the one use parameter setting correctness=correct

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="386" width="681">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="38" y="77">
        <parameter key="repository_entry" value="../data talbe/157000_85"/>
      </operator>
      <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="75">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="back_freq|back_avg_distance|candidate_len|freq_keyword|snippets|suppE|suppC|keyword_id_ch|correctness|roverd|ranking|dis|lift|front_freq"/>
      </operator>
      <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="75">
        <process expanded="true" height="431" width="373">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="112" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="correctness=wrong"/>
          </operator>
          <operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="246" y="30">
            <parameter key="sample_size" value="5661"/>
          </operator>
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="165">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="correctness=correct"/>
          </operator>
          <operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="246" y="165"/>
          <operator activated="true" class="naive_bayes" expanded="true" height="76" name="Naive Bayes" width="90" x="246" y="300"/>
          <connect from_port="training" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="414" width="373">
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="51" y="43">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_classification" expanded="true" height="76" name="Performance" width="90" x="227" y="44">
            <list key="class_weights"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 2"/>
      <connect from_op="Validation" from_port="training" to_port="result 1"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #3 on: June 10, 2010, 03:38:06 PM »

Hi,

this is clearly going far beyond of the scope of this board (and actually also of this forum). A process like this isn't made within a minute.

However, I have created a process for the desired task and uploaded it with the Community Extension of RapidMiner under the name "Same Number of Examples per Class (Stratification; Loops and Macros)". Just download and install the Community Extension and search for the process (search in this forum for more information, some infos can also be found in my signature below).

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #4 on: June 10, 2010, 03:45:40 PM »

Greetings O Pointy One,

You beat me to it! Drat ! Can we not have a badge/smiley pointing folks there, lest we have to repeat ( this exact one of balancing data comes up repeatedly ).

Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #5 on: June 10, 2010, 03:55:00 PM »

I might have been faster but the solution can still be optimized  Grin A good idea would be to extract the label automatically without having the user define it via a macro. The second thing is that I loose one example in the minority class  Roll Eyes

Anyway, I moved the discussion into this board here and made it also sticky so that we can easily link to this one in future.

Cheers,
Ingo
« Last Edit: June 10, 2010, 04:27:23 PM by Ingo Mierswa » Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #6 on: June 10, 2010, 04:46:04 PM »

Hi,

I think this covers the points you made - must say I found the 'Append' operator placement a challenge, still it does show the world of collections at work..

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="335" width="791">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="120">
        <parameter key="repository_entry" value="//Samples/data/Sonar"/>
      </operator>
      <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="120">
        <parameter key="macro" value="exs"/>
      </operator>
      <operator activated="true" class="loop_values" expanded="true" height="76" name="Loop Values" width="90" x="313" y="120">
        <parameter key="attribute" value="class"/>
        <process expanded="true" height="453" width="809">
          <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="141" y="94">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="class=%{loop_value}"/>
          </operator>
          <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro (2)" width="90" x="313" y="75">
            <parameter key="macro" value="subexs"/>
          </operator>
          <operator activated="true" class="generate_macro" expanded="true" height="76" name="Generate Macro" width="90" x="447" y="75">
            <list key="function_descriptions">
              <parameter key="exs" value="min(%{subexs},%{exs})"/>
            </list>
          </operator>
          <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Extract Macro (2)" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
          <connect from_op="Generate Macro" from_port="through 1" to_port="out 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_collection" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="120">
        <parameter key="unfold" value="true"/>
        <parameter key="parallelize_iteration" value="true"/>
        <process expanded="true" height="353" width="809">
          <operator activated="true" class="sample" expanded="true" height="76" name="Sample" width="90" x="269" y="53">
            <parameter key="sample_size" value="%{exs}"/>
          </operator>
          <connect from_port="single" to_op="Sample" to_port="example set input"/>
          <connect from_op="Sample" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" expanded="true" height="76" name="Append" width="90" x="581" y="120"/>
      <connect from_op="Retrieve" from_port="output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop Values" to_port="example set"/>
      <connect from_op="Loop Values" from_port="out 1" to_op="Loop Collection" to_port="collection"/>
      <connect from_op="Loop Collection" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
JohnQuest
Newbie
*
Posts: 15


« Reply #7 on: June 17, 2010, 11:05:02 AM »

Thanks, I will try it out

John
Logged
JohnQuest
Newbie
*
Posts: 15


« Reply #8 on: June 22, 2010, 04:09:23 AM »

Dear All
             I still having some problem understand the last XML post by haddock, I cannot connect the macros to two outputs.
             My question is still regarding my XML post on 10 June, I make it simpler and only looking at the problem this time, please see the attached XML codes.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="396" width="779">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
        <parameter key="repository_entry" value="//Project CE/cep8/data talbe/157000_85"/>
      </operator>
      <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples (2)" width="90" x="179" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="correctness=wrong"/>
      </operator>
      <operator activated="true" class="filter_examples" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="165">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="correctness=correct"/>
      </operator>
      <operator activated="true" class="sample_stratified" expanded="true" height="76" name="Sample (Stratified)" width="90" x="380" y="30">
        <parameter key="sample_size" value="1662"/>
      </operator>
      <operator activated="true" class="append" expanded="true" height="94" name="Append" width="90" x="514" y="120"/>
      <connect from_op="Retrieve" from_port="output" to_op="Filter Examples (2)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
      <connect from_op="Filter Examples (2)" from_port="original" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Sample (Stratified)" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
             We want the operator "sample_stratified" take the exact amount according to the number of examples from "filter_examples 1" value="correctness=correct". Any ideas, thanks in advance for your support.


John
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #9 on: June 22, 2010, 10:15:51 AM »

Did you try the process I have uploaded with the Community Extension? Could help here...

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
JohnQuest
Newbie
*
Posts: 15


« Reply #10 on: June 24, 2010, 05:45:19 AM »

Dear Ingo
                Sorry for this question, how do I access the files uploaded in community extension, thanks.

Best regards

John
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #11 on: June 24, 2010, 09:34:17 AM »

Hi,

no problem. You can find some explanations here in the forum:


The baseline is: You can simply download and install our Community Extension via the Update- and Installation option in our Help menu and activate the "myExperiment Browser" in the View menu of RapidMiner afterwards. In this view, you can search for the process stated above and directly download it into RapidMiner with a single click on "Open".

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
JohnQuest
Newbie
*
Posts: 15


« Reply #12 on: June 30, 2010, 11:57:14 AM »

Dear Ingo Mierswa
                Thanks, sorry for the late reply, sometimes it is difficult to come back to my posts, besides from "show new replies", the only way I can find my post is from profile. would you tell me another way, thanks.

                I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.

John Quest

               
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #13 on: June 30, 2010, 05:42:46 PM »

Dear John Quest,

(I thought we were already at the stage of using "John" and "Ingo"  Wink )

Quote
I found your process named "same number of examples per class" I can not understand what does "extract marco" and "loop process" do, since there is no output after "loop process". Thanks in advance for your support.

What exactly do you not understand? The first loop values is only used for calculating the size of the minimal class and storing this size in a macro.

Cheers,
Ingo (Mierswa  Wink )
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
JohnQuest
Newbie
*
Posts: 15


« Reply #14 on: July 02, 2010, 07:14:35 AM »

Dear Ingo

                Thanks, I may modified it into something more interesting and upload it to the community, may need your help if I got problems, thanks in advance for your support.

Best Regards

John
Logged
Pages: [1]
  Print  
 
Jump to: