Danyo83
Newbie

Posts: 16
|
 |
« on: January 15, 2012, 10:06:38 PM » |
|
Hi,
I have a classification problem wiht 2 classes. Unfortunately one cannot access the prediction label when using a Feature Selection process. So I saved the attributes weights and started a new process with the loaded weights. I applied the Model to the "unseen" testset and compared its performance with the performance of the FS Process which used the same weights. The performance of the applied Model is differs a lot to the testset performance of the FS process. Can you fix this bug and mabye offer a possibility to access the prediction label via the simple FS process. Furthermore I have to report that when saving the Model in an XMl file or similar, and recalling it, the performance also differs a lot to the FS process Performance. Can you fix it?
Thanks in advance Daniel
|
|
|
|
« Last Edit: February 28, 2012, 12:12:20 PM by Danyo83 »
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #1 on: January 16, 2012, 01:53:31 PM » |
|
Hi Daniel,
can you please describe in detail what you are doing with the loaded weights, and how you are performing the Feature Selection? Most useful would be example processes.
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #2 on: January 30, 2012, 09:25:10 PM » |
|
Hi Marius, thank you very much for your reply. Unfortunately I just found out that you wrote me, maybe you can offer an automatic info mail. Anyway I have a Feature Selection process (linear Split validation). I need the outcome of the prediction labels of the Testset, but unfortunately, this is currently not possible with RM. So I use the "save model" and "load modell" and "apply model" operators and perform the process again only on the testet in order to get the predicions which I need for further processes. The problem is, that the Model is not at all the same as I saved before.The classifiaction accuracy differ a lot, although the testset in the FS process and the applied testset in the loaded model are identical. Its the same problem as here: http://rapid-i.com/rapidforum/index.php/topic,3438.msg16533.html#msg16533Can I send you my process?? Thanks in advance and again sorry for the late reply. Daniel
|
|
|
|
|
Logged
|
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #3 on: January 31, 2012, 09:51:12 AM » |
|
I forgot something. Since the process via save and load model did not work, I built another process via the operaters save and load attribute weights. The attribute weights are saved after the FS process and loaded when using the split validation with the same classifier. So the accuracy is the same as in the testset of the FS process. But still, the accuracy is not exactly the same as of the testset of the FS process but at learst similar. It is hard to describe it. I would appreciate to send you both processes.
Tanks in advance
Daniel
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #4 on: January 31, 2012, 07:46:52 PM » |
|
Hi Daniel,
you can post your processes here in the forum. Just open the process in RapidMiner, go to the XML tab on top of the process view and copy the xml code into your post, surrounding it with code tags via the "#" button above the input field here in the forum.
Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #5 on: February 03, 2012, 08:35:18 PM » |
|
Hi Marius, this is the code. Instead of using the "store (model)" and "recall (model)" operators, one can also use the "write model" and "load model". Since I cannot directly access the prediction label (for the testset) of the Feature selection process, I need to save the built model after the FS process in order to load it and apply the model to the identical testdata. Since I cannot see the predicted label and see the performance evalutation at the same time, I need to do this process again, but this time with a perfomance evaluation operator at the end, to be able to compare the performance results of the FS process and the built and applied model. Acutally the performance should be the same since the testset data ius identical. But the results differ without any reason. I have checked it a hundred times. Do you have an explanation  P.S. Since the the maximum number of characters reached, I deleted some features in the code, but this shouldnt be a problem... <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.017"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="995" width="846"> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="179" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="keep_best" value="3"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="604" width="748"> <operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="naive_bayes" compatibility="5.1.017" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/> <connect from_port="training" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="346"> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <operator activated="true" class="remember" compatibility="5.1.017" expanded="true" height="60" name="Remember_Model" width="90" x="313" y="120"> <parameter key="name" value="Model_new"/> <parameter key="io_object" value="Model"/> </operator> <operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="514" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.FS.value.performance"/> </list> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="model" to_op="Remember_Model" to_port="store"/> <connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="write_weights" compatibility="5.1.017" expanded="true" height="60" name="Write Weights" width="90" x="514" y="120"> <parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/> </operator> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV (2)" width="90" x="246" y="300"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="268" value="a268.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="recall" compatibility="5.1.017" expanded="true" height="60" name="Recall (2)_Model" width="90" x="246" y="210"> <parameter key="name" value="Model_new"/> <parameter key="io_object" value="Model"/> <parameter key="remove_from_store" value="false"/> </operator> <operator activated="true" class="read_weights" compatibility="5.1.017" expanded="true" height="60" name="AttributeWeightsLoader (3)" width="90" x="380" y="345"> <parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/> </operator> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="514" y="300"/> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model" width="90" x="648" y="255"> <list key="application_parameters"/> </operator> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV (3)" width="90" x="246" y="615"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="2" value="a2.true.real.attribute"/> <parameter key="3" value="a3.true.real.attribute"/> <parameter key="268" value="a268.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="recall" compatibility="5.1.017" expanded="true" height="60" name="Recall (3)_model" width="90" x="246" y="525"> <parameter key="name" value="Model_new"/> <parameter key="io_object" value="Model"/> <parameter key="remove_from_store" value="false"/> </operator> <operator activated="true" class="read_weights" compatibility="5.1.017" expanded="true" height="60" name="AttributeWeightsLoader (2)" width="90" x="380" y="705"> <parameter key="attribute_weights_file" value="C:\Users\Node\daniel_att_weights.wgt"/> </operator> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="447" y="570"/> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="581" y="525"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="715" y="480"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="FS" to_port="example set in"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="Write Weights" to_port="input"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="Write Weights" from_port="through" to_port="result 5"/> <connect from_op="Read CSV (2)" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/> <connect from_op="Recall (2)_Model" from_port="result" to_op="Apply Model" to_port="model"/> <connect from_op="AttributeWeightsLoader (3)" from_port="output" to_op="AttributeWeightSelection (2)" to_port="weights"/> <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/> <connect from_op="Apply Model" from_port="labelled data" to_port="result 3"/> <connect from_op="Read CSV (3)" from_port="output" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="Recall (3)_model" from_port="result" to_op="Apply Model (2)" to_port="model"/> <connect from_op="AttributeWeightsLoader (2)" from_port="output" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> <portSpacing port="sink_result 6" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #6 on: February 06, 2012, 11:07:26 AM » |
|
How much does the performance differ? Since one time you use a Validation operator and one time not, the performance does differ a bit, but should be within the same magnitude. Btw, you can simplify your process a bit (see below). Best, Marius <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="995" width="846"> <operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="recall" compatibility="5.2.000" expanded="true" height="60" name="Recall (3)_model" width="90" x="45" y="210"> <parameter key="name" value="Model_new"/> <parameter key="io_object" value="Model"/> <parameter key="remove_from_store" value="false"/> </operator> <operator activated="true" class="multiply" compatibility="5.2.000" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/> <operator activated="true" class="optimize_selection" compatibility="5.2.000" expanded="true" height="94" name="FS" width="90" x="514" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="keep_best" value="3"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="604" width="748"> <operator activated="true" class="split_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="naive_bayes" compatibility="5.2.000" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/> <connect from_port="training" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="346"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <operator activated="true" class="remember" compatibility="5.2.000" expanded="true" height="60" name="Remember_Model" width="90" x="313" y="120"> <parameter key="name" value="Model_new"/> <parameter key="io_object" value="Model"/> </operator> <operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="ProcessLog" width="90" x="514" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.FS.value.performance"/> </list> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="model" to_op="Remember_Model" to_port="store"/> <connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="380" y="255"/> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Recall (3)_model" from_port="result" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/> <connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #7 on: February 09, 2012, 12:32:53 AM » |
|
Hi Marius,
thanks a lot. this really helps to make it easier. I only needed to add another CSV Reader (and unable the multiply operator), since the model applier should only be applied to the testset of the data not to the whole dataset which also includes the training data...
I thought that, since the testset is identical to set of the applied model, the performance should not differ right? The model is built after the validation process. how can it be, that the testset is not classified indentically? The performance accuracy sometimes differ only 3 % (67 vs. 64%) but somtimes it differs 22 % (68 vs 46%) The last is the case when the validation process proceeds a long time even if the performance does not improve for a long time. The strange thing is that the applied model predicted every datapoint into the same class, never into the other one (it is a 2 class case). That is why the accuracy is only 46% while the accuracy of the testset of the forward selection process has 68%. It is really annoying that I cannot fix it.
Can you help me?
Thanks in advance
Daniel
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #8 on: February 10, 2012, 10:54:22 AM » |
|
Hi Daniel, I had another look at the process, and the way it was setup before it does not make sense. The Forward Selection executes its subprocess for many combinations of parameters, and there is no guarantee that the last execution takes place on the best feature set, and thus the last stored model is not necessarily the best. You have to output the weights, apply them on the training data and then create the final model. Then you can apply it on the weighted test data. By the way, you should exchange you Split Validation with a X-Validation for more reliable results, even though it will take more time to run. Best, Marius <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.2.001" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="539" width="768"> <operator activated="true" class="read_csv" compatibility="5.2.001" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="multiply" compatibility="5.2.001" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/> <operator activated="true" class="optimize_selection" compatibility="5.2.001" expanded="true" height="94" name="FS" width="90" x="380" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="keep_best" value="3"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="521" width="433"> <operator activated="true" class="split_validation" compatibility="5.2.001" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes" width="90" x="148" y="30"/> <connect from_port="training" to_op="Naive Bayes" to_port="training set"/> <connect from_op="Naive Bayes" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="346"> <operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.001" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <operator activated="true" class="log" compatibility="5.2.001" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.FS.value.performance"/> </list> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="select_by_weights" compatibility="5.2.001" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/> <operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes (2)" width="90" x="380" y="210"/> <operator activated="true" class="read_csv" compatibility="5.2.001" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="select_by_weights" compatibility="5.2.001" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/> <operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.001" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="Naive Bayes (2)" to_port="training set"/> <connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/> <connect from_op="Naive Bayes (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/> <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/> <connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #9 on: February 13, 2012, 04:24:54 PM » |
|
Hi Marius, thanks, that really makes sense. It unfortunately works for the Naive Bayes Classifier, but when I changed it to the Linear Discriminant Analysis the error still occurs. Unfortunately the accuracy of the testset of the Forward Selection process is 71,15% while the accuracy of the applied model onto the identical dataset is 46,15% (all the labeled data is classified into the same class). the selected attributes are the same, so this is not the underlying error... I really have no idea how this can occur Here is the process <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.017"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="539" width="768"> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="2" value="a2.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="multiply" compatibility="5.1.017" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/> <operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="380" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="keep_best" value="3"/> <parameter key="maximum_number_of_generations" value="80"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="521" width="433"> <operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/> <connect from_port="training" to_op="LDA" to_port="training set"/> <connect from_op="LDA" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="346"> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.FS.value.performance"/> </list> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="2" value="a2.true.real.attribute"/> <parameter key="3" value="a3.true.real.attribute"/> <parameter key="267" value="a267.true.real.attribute"/> <parameter key="268" value="a268.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/> <connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/> <connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/> <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/> <connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #10 on: February 13, 2012, 04:59:53 PM » |
|
I just saw that your Validation is set to "linear sampling". That means that it uses the first X examples for training and the others for testing. If your data is sorted somehow, the wrong distribution is used for training and testing. You should switch the sampling mode to stratified sampling. That way it is guaranteed that the class ratio of positive and negative examples is identical in training and test set. The outer Model training does not suffer from that problem, since it uses unsampled data. But even after that fix the performances won't be exactly the same, because the outer Train/Apply combination uses the whole dataset both for training and for testing, whereas the FS uses only a part of the data for training and the other part for testing. Btw, in the log operator you should log the "peformance" of the Validation, not of the FS. And I still suggest urgently to exchange the Split Validation with a X-Validation  Best, Marius
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #11 on: February 13, 2012, 05:22:45 PM » |
|
The CSV files aren't the same. The first CSV file which is used for the FS comprises the training and the testset. I need the the linear sampling since it need to be in orden. the first 2544 points need to be the training set and the following 260 points need to be the testset. So it mustn't be mixed. Since I cannot directly access the prediciton label of the testset (via the FS process), I build up the model applier in order to be able to access the prediction label. The 2. CSV file therefore only comprises the testset, hence the 260 datapoints. That is why I think that the classification performance of the testset of the FS process should be at least nearly the same as the performance of the 2. process.
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #12 on: February 13, 2012, 06:14:30 PM » |
|
In that case you need a Filter Example Range operator in front of the outer LDA to select only the first 2544 examples.
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
Danyo83
Newbie

Posts: 16
|
 |
« Reply #13 on: February 14, 2012, 02:00:10 PM » |
|
Hi Marius, thanks a lot. Now it works without any difference in both classification accuracies. I have implemented the mentioned Filter and I have put the process log operator into the Validation process. Is this code correct? <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.1.017"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="539" width="768"> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="multiply" compatibility="5.1.017" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/> <operator activated="true" class="optimize_selection" compatibility="5.1.017" expanded="true" height="94" name="FS" width="90" x="380" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="maximum_number_of_generations" value="80"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="521" width="433"> <operator activated="true" class="split_validation" compatibility="5.1.017" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/> <connect from_port="training" to_op="LDA" to_port="training set"/> <connect from_op="LDA" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="480"> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <operator activated="true" class="log" compatibility="5.1.017" expanded="true" height="76" name="ProcessLog" width="90" x="313" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.FS.value.performance"/> </list> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="averagable 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="2" value="a2.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="filter_example_range" compatibility="5.1.017" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="210"> <parameter key="first_example" value="1"/> <parameter key="last_example" value="2544"/> </operator> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.1.017" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/> <operator activated="true" class="select_by_weights" compatibility="5.1.017" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/> <operator activated="true" class="apply_model" compatibility="5.1.017" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.1.017" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/> <connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/> <connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/> <connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
|
|
|
|
Marius
|
 |
« Reply #14 on: February 14, 2012, 02:31:30 PM » |
|
Yes, it seems to be fine. Just for the Log operator you got me wrong: it should stay at the FS, but in its configuration log the performance of the validation, as below. Best, Marius <?xml version="1.0" encoding="UTF-8" standalone="no"?> <process version="5.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Root"> <description><p> Transformations of the attribute space may ease learning in a way, that simple learning schemes may be able to learn complex functions. This is the basic idea of the kernel trick. But even without kernel based learning schemes the transformation of feature space may be necessary to reach good learning results. </p> <p> RapidMiner offers several different feature selection, construction, and extraction methods. This selection process (the well known forward selection) uses an inner cross validation for performance estimation. This building block serves as fitness evaluation for all candidate feature sets. Since the performance of a certain learning scheme is taken into account we refer to processes of this type as &quot;wrapper approaches&quot;.</p> <p>Additionally the process log operator plots intermediate results. You can inspect them online in the Results tab. Please refer to the visualization sample processes or the RapidMiner tutorial for further details.</p> <p> Try the following: <ul> <li>Start the process and change to &quot;Result&quot; view. There can be a plot selected. Plot the &quot;performance&quot; against the &quot;generation&quot; of the feature selection operator.</li> <li>Select the feature selection operator in the tree view. Change the search directory from forward (forward selection) to backward (backward elimination). Restart the process. All features will be selected.</li> <li>Select the feature selection operator. Right click to open the context menu and repace the operator by another feature selection scheme (for example a genetic algorithm).</li> <li>Have a look at the list of the process log operator. Every time it is applied it collects the specified data. Please refer to the RapidMiner Tutorial for further explanations. After changing the feature selection operator to the genetic algorithm approach, you have to specify the correct values. <table><tr><td><icon>groups/24/visualization</icon></td><td><i>Use the process log operator to log values online.</i></td></tr></table> </li> </ul> </p></description> <process expanded="true" height="539" width="768"> <operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="multiply" compatibility="5.2.000" expanded="true" height="94" name="Multiply" width="90" x="179" y="30"/> <operator activated="true" class="optimize_selection" compatibility="5.2.000" expanded="true" height="94" name="FS" width="90" x="380" y="30"> <parameter key="generations_without_improval" value="40"/> <parameter key="limit_number_of_generations" value="true"/> <parameter key="maximum_number_of_generations" value="80"/> <parameter key="normalize_weights" value="false"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="521" width="681"> <operator activated="true" class="split_validation" compatibility="5.2.000" expanded="true" height="112" name="Validation" width="90" x="112" y="30"> <parameter key="split" value="absolute"/> <parameter key="split_ratio" value="0.95"/> <parameter key="training_set_size" value="2544"/> <parameter key="test_set_size" value="260"/> <parameter key="sampling_type" value="linear sampling"/> <parameter key="use_local_random_seed" value="true"/> <process expanded="true" height="191" width="331"> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.2.000" expanded="true" height="76" name="LDA" width="90" x="136" y="30"/> <connect from_port="training" to_op="LDA" to_port="training set"/> <connect from_op="LDA" from_port="model" to_port="model"/> <portSpacing port="source_training" spacing="0"/> <portSpacing port="sink_model" spacing="0"/> <portSpacing port="sink_through 1" spacing="0"/> </process> <process expanded="true" height="296" width="480"> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Applier" width="90" x="45" y="30"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_Validation" width="90" x="179" y="30"> <parameter key="classification_error" value="true"/> <parameter key="weighted_mean_recall" value="true"/> <parameter key="absolute_error" value="true"/> <parameter key="correlation" value="true"/> <list key="class_weights"/> </operator> <connect from_port="model" to_op="Applier" to_port="model"/> <connect from_port="test set" to_op="Applier" to_port="unlabelled data"/> <connect from_op="Applier" from_port="labelled data" to_op="Performance_Validation" to_port="labelled data"/> <connect from_op="Performance_Validation" from_port="performance" to_port="averagable 1"/> <portSpacing port="source_model" spacing="0"/> <portSpacing port="source_test set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_averagable 1" spacing="0"/> <portSpacing port="sink_averagable 2" spacing="0"/> </process> </operator> <operator activated="true" class="log" compatibility="5.2.000" expanded="true" height="76" name="ProcessLog" width="90" x="447" y="30"> <list key="log"> <parameter key="generation" value="operator.FS.value.generation"/> <parameter key="performance" value="operator.Validation.value.performance"/> </list> </operator> <connect from_port="example set" to_op="Validation" to_port="training"/> <connect from_op="Validation" from_port="averagable 1" to_op="ProcessLog" to_port="through 1"/> <connect from_op="ProcessLog" from_port="through 1" to_port="performance"/> <portSpacing port="source_example set" spacing="0"/> <portSpacing port="source_through 1" spacing="0"/> <portSpacing port="sink_performance" spacing="0"/> </process> </operator> <operator activated="true" class="read_csv" compatibility="5.2.000" expanded="true" height="60" name="Read Test Data" width="90" x="45" y="345"> <parameter key="csv_file" value="D:\Promotion\Matlab\Ich\Workspaces\Tag\Zeit\Feature_Set_final_test.dat"/> <parameter key="column_separators" value=","/> <parameter key="first_row_as_names" value="false"/> <list key="annotations"> <parameter key="0" value="Name"/> </list> <parameter key="encoding" value="windows-1252"/> <list key="data_set_meta_data_information"> <parameter key="0" value="Label.true.binominal.label"/> <parameter key="1" value="a1.true.real.attribute"/> <parameter key="2" value="a2.true.real.attribute"/> <parameter key="269" value="a269.true.real.attribute"/> <parameter key="270" value="a270.true.integer.attribute"/> <parameter key="271" value="a271.true.integer.attribute"/> </list> </operator> <operator activated="true" class="filter_example_range" compatibility="5.2.000" expanded="true" height="76" name="Filter Example Range" width="90" x="45" y="210"> <parameter key="first_example" value="1"/> <parameter key="last_example" value="2544"/> </operator> <operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (3)" width="90" x="179" y="210"/> <operator activated="true" class="linear_discriminant_analysis" compatibility="5.2.000" expanded="true" height="76" name="LDA (2)" width="90" x="346" y="210"/> <operator activated="true" class="select_by_weights" compatibility="5.2.000" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="380" y="345"/> <operator activated="true" class="apply_model" compatibility="5.2.000" expanded="true" height="76" name="Apply Model (2)" width="90" x="514" y="210"> <list key="application_parameters"/> </operator> <operator activated="true" class="performance_classification" compatibility="5.2.000" expanded="true" height="76" name="Performance_ungesehen" width="90" x="648" y="210"> <parameter key="classification_error" value="true"/> <parameter key="absolute_error" value="true"/> <list key="class_weights"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Multiply" to_port="input"/> <connect from_op="Multiply" from_port="output 1" to_op="FS" to_port="example set in"/> <connect from_op="Multiply" from_port="output 2" to_op="Filter Example Range" to_port="example set input"/> <connect from_op="FS" from_port="example set out" to_port="result 2"/> <connect from_op="FS" from_port="weights" to_op="AttributeWeightSelection (3)" to_port="weights"/> <connect from_op="FS" from_port="performance" to_port="result 1"/> <connect from_op="Read Test Data" from_port="output" to_op="AttributeWeightSelection (2)" to_port="example set input"/> <connect from_op="Filter Example Range" from_port="example set output" to_op="AttributeWeightSelection (3)" to_port="example set input"/> <connect from_op="AttributeWeightSelection (3)" from_port="example set output" to_op="LDA (2)" to_port="training set"/> <connect from_op="AttributeWeightSelection (3)" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/> <connect from_op="LDA (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/> <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_op="Apply Model (2)" to_port="unlabelled data"/> <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance_ungesehen" to_port="labelled data"/> <connect from_op="Performance_ungesehen" from_port="performance" to_port="result 3"/> <connect from_op="Performance_ungesehen" from_port="example set" to_port="result 4"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> <portSpacing port="sink_result 5" spacing="0"/> </process> </operator> </process>
|
|
|
|
|
Logged
|
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
|
|
|
|