Pages: [1]
  Print  
Author Topic: Insufficient results with M5P regression tree  (Read 3412 times)
michaelhecht
Jr. Member
**
Posts: 90


« on: May 21, 2009, 07:11:02 PM »

Hello,

if anyone is interested please try the following:

produce a file containing two columns

x = 0, 0.1, 0.2, ..., 12.6;
y = sin(x)

Then apply M5P (with or without normalization).

The result is quite disappointing. Does anyone know how to get an acceptable result?
I expected to get something like a picewise linear approximation of the sin function,
but got something far away from this.

Thank You.
Logged
keith
Full Member
***
Posts: 160


« Reply #1 on: May 21, 2009, 08:46:22 PM »

Hello,

if anyone is interested please try the following:


It would actually be more helpful (and more likely to generate a response), if you include the XML for the process you are running in your forum post.

Quote
produce a file containing two columns

x = 0, 0.1, 0.2, ..., 12.6;
y = sin(x)

Then apply M5P (with or without normalization).

The result is quite disappointing. Does anyone know how to get an acceptable result?
I expected to get something like a picewise linear approximation of the sin function,
but got something far away from this.

There's a fair amount of ambiguity in your question.  What constitutes an acceptable result?  Is there a reason you believe this M5P is a good learner in this situation?  Did you experiment with any the options for the M5P learner? 

I just tried the following process, and the only changes from the default settings are to click the check box for parameters N, U, and R :

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename" value="c:\temp\xy.csv"/>
        <parameter key="label_name" value="y"/>
    </operator>
    <operator name="W-M5P" class="W-M5P">
        <parameter key="keep_example_set" value="true"/>
        <parameter key="N" value="true"/>
        <parameter key="U" value="true"/>
        <parameter key="R" value="true"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>

And the plot of x vs. prediction(y) looks, to my eyes, much more sin-like.  But I don't know if using an unpruned, unsmoothed learner makes sense for your problem.

Keith
Logged
michaelhecht
Jr. Member
**
Posts: 90


« Reply #2 on: May 21, 2009, 09:08:45 PM »

Hi,

sorry, here is the XML

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes"   value="C:\Programme\Rapid-I\RapidMiner-4.4\sinus"/>
    </operator>
    <operator name="Normalization" class="Normalization">
    </operator>
    <operator name="W-M5P" class="W-M5P">
        <parameter key="keep_example_set"   value="true"/>
        <parameter key="U"   value="true"/>
        <parameter key="M"   value="10.0"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
    </operator>
</operator>

What I get is a piecewise constant result, i.e. the leafes of the tree are:  y = const
Only the last leaf gives a linear model: y = 3.2196 * x - 4.5545

If I had such a "really" linear model at all leafes of the tree, it would be ok, i.e. as
I would expect it.
There are no settings which can improve it, even if the tree could result in y = a*x+b
in each leaf, which should give a better prediction. So why does'nt M5P behave like
this?
If I select the smoothed tree the results are even worse.

I hope I could make my "problem" more clear to you.

P.S.:
Maybe if you google for "stepwise regression tree HUANG" or go directly to
http://www.landcover.org/pdf/ijrs24_p75.pdf
and there at page 77 (i.e. page 3 in the 16 pages document) you see what I
mean. If this SRT algorithm would become a part of RapidMiner I would
appreciate it  Wink , even if I don't understand why M5P doesn't behave comparable.
Logged
keith
Full Member
***
Posts: 160


« Reply #3 on: May 21, 2009, 09:52:16 PM »

I think you're getting into trouble because of the value of M (mininum # values per leaf), and the cyclical nature of the data.  If I take your process flow, change M from 10 to 5, I get linear models for nodes 1, 5, 6, 7, 10, 11, 14, 15, 16, and 20, and constant values elsewhere.
Logged
michaelhecht
Jr. Member
**
Posts: 90


« Reply #4 on: May 22, 2009, 07:40:59 AM »

Ok, I see the difference.

Nevertheless, I cannot understand, why the fraction of constant leafs, i.e. y = const, increases if I change M from 5 to 6.
I get 10 constant leafes more at positions where y = a*x+b would be better. Isn't the result with a constant regression
worse than a non constant regression in the leafs?

It's clear to me that thealgorithm is from Weka and not RapidMiner, so You cannot know in detail what happens.
Nevertheless, I only want to understand, why, by increasing M, the number of constant leafs increases even
if it worses the result.

By the way, if you are an expert Wink , would it be possible to post a workflow for optimizing the parameters automatically.
Up to now I didn't get the right feeling for applying meta methods like grid search or x-validation in the right way.

Thank's in advance. (At least I need an answer on my question, the workflow would be nice)

Logged
keith
Full Member
***
Posts: 160


« Reply #5 on: May 22, 2009, 02:28:37 PM »

I'm far from an expert with RM.  :-)  And I've never used M5P before now, so what little I know it just came from a little experimentation and Googling yesterday.  The paper describing the method seems to be available at http://www.cs.waikato.ac.nz/pubs/wp/1996/uow-cs-wp-1996-23.pdf, and it may answer that question.  My guess is that there's some kind of rule that if the slope of the regression model is too close to zero, it gets rounded off to zero.  Maybe it's an interaction between the number of observations in each node and the regressed slope.  Beyond that, you'd probably have more luck getting an answer from a Weka list or forum

As for the parameter optimization, take a look at 07_Meta/01_ParameterOptimization.xml in the RM samples directory.  The GridParameterOptimization node is where you'd specify what parameters you want to tinker with.
Logged
michaelhecht
Jr. Member
**
Posts: 90


« Reply #6 on: May 22, 2009, 02:34:50 PM »

Thank You again, I try to find an appropriate solution for me Wink

The problem where I tested M5P was originally only for me to get an idea how M5P works.
Finally I'm really in doubt applying this method to other data that I'm not familiar to.

Logged
Pages: [1]
  Print  
 
Jump to: