Pages: [1]
  Print  
Author Topic: Nearest neighbours always gives the same prediction ? !!  (Read 1233 times)
traveria
Newbie
*
Posts: 2


« on: March 28, 2009, 12:46:35 AM »

Hello, I am having an astonishing result:

I run a very simple test with nearest neighbors (see xml code below) and I am using a training dataset and an test dataset (see short datasets below).

The seldom result is that I always get the same value for the predicted value, despite the test example I use  Shocked

If I use the "ExampleSetGenerator" instead of reading a dataset in a file (activate it in the model I include below) I get a different prediction for every new test example I use, as it is expected.

Can anyone explain what is the reason for getting always the same prediction if I read data from a file?? Huh

Any hint or solution will be welcomed!!!!

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma.aml"/>
    </operator>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator" activated="no">
        <parameter key="number_examples"   value="10000"/>
        <parameter key="target_function"   value="polynomial"/>
    </operator>
    <operator name="NearestNeighbors" class="NearestNeighbors">
    </operator>
    <operator name="ExampleSource (4)" class="ExampleSource">
        <parameter key="attributes"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma_test.aml"/>
        <parameter key="permutate"   value="true"/>
    </operator>
    <operator name="ExampleSetGenerator (2)" class="ExampleSetGenerator" activated="no">
        <parameter key="number_examples"   value="10"/>
        <parameter key="target_function"   value="polynomial"/>
    </operator>
    <operator name="ExampleRangeFilter" class="ExampleRangeFilter">
        <parameter key="first_example"   value="2"/>
        <parameter key="last_example"   value="2"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
        <parameter key="create_view"   value="true"/>
        <parameter key="keep_model"   value="true"/>
    </operator>
</operator>

TRAINING DATA

1   1.6   0.84   0   0.76
2   2.17   0.91   0.3   0.96
3   1.61   0.14   0.48   1
4   0.84   -0.76   0.6   1
5   0.74   -0.96   0.7   1
6   1.5   -0.28   0.78   1
7   2.5   0.66   0.85   1
8   2.89   0.99   0.9   1
9   2.37   0.41   0.95   1
10   1.46   -0.54   1   1
11   1.04   -1   1.04   1
12   1.54   -0.54   1.08   1
13   2.53   0.42   1.11   1
14   3.14   0.99   1.15   1
15   2.83   0.65   1.18   1
16   1.92   -0.29   1.2   1
17   1.27   -0.96   1.23   1
18   1.5   -0.75   1.26   1
19   2.43   0.15   1.28   1
20   3.21   0.91   1.3   1

TEST DATA

21   3.16   0.84   1.32   1
22   2.33   -0.01   1.34   1
23   1.52   -0.85   1.36   1
24   1.47   -0.91   1.38   1
25   2.27   -0.13   1.4   1
26   3.18   0.76   1.41   1
27   3.39   0.96   1.43   1
28   2.72   0.27   1.45   1
29   1.8   -0.66   1.46   1
30   1.49   -0.99   1.48   1
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #1 on: March 28, 2009, 08:36:54 PM »

Hi,

I'm not sure the result is as surprising as you think. I can replicate your problem on your own data if I simply include the left hand column as a normal attribute, even though it looks looks much more like an Id attribute. If you treat it like one your "surprising" result disappears  Shocked So I think you should check your AML file to see how you've been handling that column.

Here's some code to illustrate the point, if you leave "1'" as the value for "select_which" in the very first operator all the predictions are the same, but they are not all the same if you insert "2" instead. That is because the second example source marks column one as an Id column, whereas the first does not.

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="OperatorSelector" class="OperatorSelector" expanded="yes">
        <operator name="SimpleExampleSource" class="SimpleExampleSource">
            <parameter key="filename" value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
            <parameter key="label_column" value="2"/>
        </operator>
        <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
            <parameter key="filename" value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
            <parameter key="label_column" value="2"/>
            <parameter key="id_column" value="1"/>
        </operator>
    </operator>
    <operator name="IOMultiplier" class="IOMultiplier">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="ExampleRangeFilter" class="ExampleRangeFilter">
        <parameter key="first_example" value="1"/>
        <parameter key="last_example" value="20"/>
    </operator>
    <operator name="NearestNeighbors" class="NearestNeighbors">
    </operator>
    <operator name="IOSelector" class="IOSelector">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="ExampleRangeFilter (2)" class="ExampleRangeFilter">
        <parameter key="first_example" value="21"/>
        <parameter key="last_example" value="30"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <parameter key="keep_model" value="true"/>
        <list key="application_parameters">
        </list>
        <parameter key="create_view" value="true"/>
    </operator>
</operator>

I've attached the datafile with all 30 examples, you'll need to adjust the path to it in order to run the demo.
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
traveria
Newbie
*
Posts: 2


« Reply #2 on: March 30, 2009, 05:16:19 PM »

Many thanks Haddock,

after some investigation I realize that the reason for the algorithm to produce the same prediction in all cases is that the examples dataset has not the same data description (metadata in the aml file) than in the test dataset, hence the algorithm does not know what to predict and produces all the time the last correct prediction.

I still do not understand why both datasets have not the same structure. Try the minimalist file at the end of the message to realize that it is so: what it writes first is not the same as it writes afterwards.

After solving this little inconvenience I can run the Nearest Neighbors correctly.

Many thanks for your comments anyway Grin!!!!!

Miquel

<?xml version="1.0" encoding="UTF-8"?>
<process version="4.2">

  <operator name="Root" class="Process" expanded="yes">
      <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
          <parameter key="number_examples"   value="1000"/>
          <parameter key="target_function"   value="polynomial"/>
      </operator>
      <operator name="ExampleSetWriter (2)" class="ExampleSetWriter">
          <parameter key="attribute_description_file"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.aml"/>
          <parameter key="example_set_file"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
          <parameter key="quote_whitespace"   value="false"/>
      </operator>
      <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
          <parameter key="filename"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
          <parameter key="label_column"   value="6"/>
          <parameter key="use_quotes"   value="true"/>
      </operator>
      <operator name="FeatureRangeRemoval" class="FeatureRangeRemoval">
          <parameter key="first_attribute"   value="6"/>
          <parameter key="last_attribute"   value="6"/>
      </operator>
      <operator name="ExampleSetWriter" class="ExampleSetWriter">
          <parameter key="attribute_description_file"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.aml"/>
          <parameter key="example_set_file"   value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.dat"/>
          <parameter key="quote_whitespace"   value="false"/>
      </operator>
  </operator>

</process>
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #3 on: March 30, 2009, 07:57:36 PM »

Hi,

It is rather difficult to comment on this unless you show what you put in "polinomi_set.aml", perhaps you will oblige us?

However, there are things that are obvious, whatever you put in that file....

1.The generator produces 5 attributes and 1 label= 6 columns.

2. Removing  attribute number 6 cannot work, unless there are 6 attributes.

3. There can only be 6 attributes if the label column is set to 0.

4. But in your code it is marked as being in column 6!

5. So this code NEVER could work, whatever is in "polinomi_set.aml".

Which leaves me with a question, what on earth were you trying to achieve with this post?
« Last Edit: March 30, 2009, 08:36:17 PM by haddock » Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
Pages: [1]
  Print  
 
Jump to: