Pages: [1]
  Print  
Author Topic: Creating SVM learning sets  (Read 1801 times)
Gil
Guest
« on: June 23, 2008, 07:05:03 PM »

I think I initially put this message in the wrong category, so here it is again:
Hi,

I've been trying to apply SVM on a batch of textual documents in order to evaluate the performance of a model I developed as part of my thesis. First I used the 01_TextClassificationXVal.xml example found in the text plugin documentation. The XML of this example is brought here (I deleted some of the text processing operators - which are irrelevent to my question - in order to make it smaller):

<operator name="Root" class="Process" expanded="yes">
    <description text="#ylt#h3#ygt#Optimizing vector creation for text classification#ylt#/h3#ygt##ylt#p#ygt#This experiments shows how to apply a cross validation to a classifier that learns to separate two sets of texts.#ylt#/p#ygt#"/>
    <operator name="TextInput" class="TextInput" expanded="yes">
        <parameter key="create_text_visualizer"   value="true"/>
        <list key="namespaces">
        </list>
        <parameter key="prune_below"   value="3"/>
        <list key="texts">
          <parameter key="graphics"   value="../data/newsgroup/graphics"/>
          <parameter key="hardware"   value="../data/newsgroup/hardware"/>
        </list>
    </operator>
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="leave_one_out"   value="true"/>
        <operator name="LibSVMLearner" class="LibSVMLearner">
            <list key="class_weights">
            </list>
            <parameter key="kernel_type"   value="linear"/>
            <parameter key="shrinking"   value="false"/>
        </operator>
        <operator name="OperatorChain" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                </list>
            </operator>
            <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
                <parameter key="AUC"   value="true"/>
                <parameter key="f_measure"   value="true"/>
            </operator>
        </operator>
    </operator>
</operator>

The problem I have with his example is that the smallest learning set I can use is half of the entire dataset (if I set the value of the cross validation to 2). I would like to use a tenth of the dataset for this purpose, as it is quite large. Is there an operator that can do that for me?

Thanks in advance,
Gil
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #1 on: June 23, 2008, 10:34:53 PM »

Hi Gil,
whats about sample your data? If I got you right, you don't want to use all your examples for learning. Perhabs you could a sampling algorithm for discarding that portion of data?

Greetings,
   Sebastian
Logged
Gil
Guest
« Reply #2 on: June 24, 2008, 05:48:13 AM »

Hi Land,

Thanks for answering so quicklly.

You are right - I want to use only a small part of my set for learning, a much smaller part than what is offered by cross-validation. However, I don't know how to apply a sampling algorithm for a TextInput operator. Will it be possible for you (or anyone else, for that matter) to post an example how do do this?

In an attemp to overcome this problem from a different direction, I wrote a java code that can go over all the documents of my dataset and randomly create subsets, which I intended to use as learning sets. I then wrote two simple experiments - one for creating a model based on the subsets I created, and another one that loads that model and applies it one the entore dataset.

In order to make sure these two experiments function properly, I used half the dataset as the learning set (I thought this way I could compare my results to those pruduced by a 2-fold cross validation). Sadly, the results I got were much poorer than those produced by the cross-validation experiment - and I can't understand why that is the case. The XML of the two experiments is posted below - if I made a mistake, please help me understand what it is.

If someone could help me solve even one of these two problems, I think it will be all I need.

Thanks in advance,
Gil

The Two experiments:
1) The learning phase - creating the SVM model:


<?xml version="1.0" encoding="windows-1252"?>
<process version="4.1">

  <operator name="Root" class="Process" expanded="yes">
           <operator name="TextInput" class="TextInput" expanded="no">
          <parameter key="create_text_visualizer"   value="true"/>
          <list key="namespaces">
          </list>
          <parameter key="prune_below"   value="3"/>
          <list key="texts">
            <parameter key="type1"   value="D:\exp\type1_learnign_set"/>
            <parameter key="type2"   value="D:\exp\type2_learnign_set"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"   value="3"/>
          </operator>
          <operator name="PorterStemmer" class="PorterStemmer">
          </operator>
          <operator name="TermNGramGenerator" class="TermNGramGenerator">
          </operator>
      </operator>
      <operator name="LibSVMLearner" class="LibSVMLearner">
          <list key="class_weights">
          </list>
          <parameter key="kernel_type"   value="linear"/>
      </operator>
      <operator name="ModelWriter" class="ModelWriter">
          <parameter key="model_file"   value="C:\Documents and Settings\Admin\Desktop\SVM_Model.mod"/>
      </operator>
  </operator>

</process>

2) The test phase - applying the model

<?xml version="1.0" encoding="windows-1252"?>
<process version="4.1">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="TextInput" class="TextInput" expanded="no">
          <parameter key="create_text_visualizer"   value="true"/>
          <list key="namespaces">
          </list>
          <parameter key="prune_below"   value="3"/>
          <list key="texts">
            <parameter key="type1"   value="D:\exp\type1_full_set"/>
            <parameter key="type2"   value="D:\exp\type2_full_set"/>
          </list>
          <operator name="StringTokenizer" class="StringTokenizer">
          </operator>
          <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
          </operator>
          <operator name="TokenLengthFilter" class="TokenLengthFilter">
              <parameter key="min_chars"   value="3"/>
          </operator>
          <operator name="PorterStemmer" class="PorterStemmer">
          </operator>
          <operator name="TermNGramGenerator" class="TermNGramGenerator">
          </operator>
      </operator>
      <operator name="ModelLoader" class="ModelLoader">
          <parameter key="model_file"   value="C:\Documents and Settings\Admin\Desktop\SVM_Model.mod"/>
      </operator>
      <operator name="ModelApplier" class="ModelApplier">
          <list key="application_parameters">
          </list>
      </operator>
      <operator name="BinominalClassificationPerformance" class="BinominalClassificationPerformance">
          <parameter key="AUC"   value="true"/>
          <parameter key="f_measure"   value="true"/>
      </operator>
  </operator>

</process>


Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #3 on: June 24, 2008, 01:45:05 PM »

Hi Gil,

well there is no direct and easy way to execute a cross validation but to use say only 10% of the examples for training and the other 90% for testing purposes. The easy-to-accomplish option you have is to simply use a sampling operator (e.g. StratifiedSampling) before a cross validation. Therewith you may simply discard perhaps about 50% of your data and do a "normal" cross validation on the remaining 50%.

Otherwise you can nearly simulate a kind of multiple validation by the following process:

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="NominalExampleSetGenerator" class="NominalExampleSetGenerator">
    </operator>
    <operator name="ParameterIteration" class="ParameterIteration" expanded="yes">
        <parameter key="keep_output" value="true"/>
        <list key="parameters">
          <parameter key="SimpleValidation.local_random_seed" value="1,2,3,4,5,6,7,8,9,10"/>
        </list>
        <operator name="SimpleValidation" class="SimpleValidation" expanded="yes">
            <parameter key="local_random_seed" value="10"/>
            <operator name="NaiveBayes" class="NaiveBayes">
            </operator>
            <operator name="OperatorChain" class="OperatorChain" expanded="yes">
                <operator name="ModelApplier" class="ModelApplier">
                    <list key="application_parameters">
                    </list>
                </operator>
                <operator name="Performance" class="Performance">
                </operator>
            </operator>
        </operator>
    </operator>
    <operator name="AverageBuilder" class="AverageBuilder">
    </operator>
</operator>

Note, however, that the examples are not partitioned in the iterations.

Regards,
Tobias
« Last Edit: June 24, 2008, 01:49:35 PM by Tobias Malbrecht » Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
Pages: [1]
  Print  
 
Jump to: