Pages: [1] 2 3
  Print  
Author Topic: Prediction (Forecasting ) with RM  (Read 16569 times)
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« on: May 23, 2008, 11:51:59 PM »

Original message from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2019003&forum_id=390413

Hi,
I have some electricity data for past 1 year.Using this data i want to predict data for next 1 month.Is it possible with RM.
can somebody help me in this regard.
Thanks in advance
 
Thanks,
Swapk


Answer by Ingo Mierswa:

Hello Swapk,
 
in principle, tasks like those is what RapidMiner was made for. You could use a windowing to window the past data and create prediction models on the windows for different prediction horizons. Then, all models are applied to the last available window and the predictions are appended to the series as prediction.
 
Cheers,
Ingo


Question by Gladys:

Hello Ingo:
 
How we can implement this windowing scheme in RapidMiner?  
 
Best Regards,
 
Gladys


Answer by Ingo:

Hi Gladys,
 
the basic idea is to use a windowing operator like in the following process:
 
Code:
<operator name="Root" class="Process" expanded="yes">
<operator name="ExampleSetGenerator" class="ExampleSetGenerator">
<parameter key="number_of_attributes" value="1"/>
<parameter key="target_function" value="sum"/>
</operator>
<operator name="FeatureNameFilter" class="FeatureNameFilter">
<parameter key="filter_special_features" value="true"/>
<parameter key="skip_features_with_name" value="label"/>
</operator>
<operator name="Series2WindowExamples" class="Series2WindowExamples">
<parameter key="series_representation" value="encode_series_by_examples"/>
<parameter key="window_size" value="10"/>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<operator name="LibSVMLearner" class="LibSVMLearner">
<list key="class_weights">
</list>
<parameter key="svm_type" value="epsilon-SVR"/>
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
</operator>
 

There is also an operator for multivariate windowing and there are sliding window validations (backtesting) which are more appropriate for this type of analysis. Just try to play around in this field!
 
Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
marcel
Guest
« Reply #1 on: May 24, 2008, 08:57:33 AM »

We are working on network trafic data and I guess we need also this windowing function to predict future usage. Please can you provide some further information about this function. Off course, we can copy the xml example in our experiment, but we would like to know a bit more how it works. Are there some documents for further reading or some examples. Any information will be of any help. Thanks!

Marcel.
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #2 on: May 24, 2008, 10:18:42 AM »

Hi Marcel,

the basic idea is pretty simple. Let's say you have a series of values (univariate case, i.e. only one dimension):

v1
v2
v3
v4
v5
v6
...
v100


The task now is to learn from the past to predict a value sometime in the future. For solving this task, we will employ a windowing approach. The first question is: how long should be the history we look at? This is the width of the windows. Let's say we regard a history of 5 values for each prediction. The second question is, how far we want to look into the future? For sake of simplicity, let's say we just want to predict the next value (i.e. we use a prediction horizon of 1). A windowing will then transform the data set like this (using a step size of 1):

Code:
att1 att2 att3 att4 att5 label
--------------------------------
v1   v2   v3   v4   v5   v6
v2   v3   v4   v5   v6   v7
v3   v4   v5   v6   v7   v8
...
v95  v96  v97  v98  v99  v100

The first five attribute are the history which is taken into account as attributes / variables / features to learn from. The label is the value which should be predicted. It's is simply the next value after the last value of the window (since we chose a horizon of 1).

On this new data set, you can simply use any regression learning technique you want. Together with strong learners like SVM, this method often clearly outperforms classical methods like ARMA / ARIMA and delivers better and more robust results than neural networks for time series predictions. And with RapidMiner, you can easily apply all preprocessing techniques, extract features, create preprocessing models and perform fair evaluations (backtesting!).

More information about this windowing approach can be found in the master thesis of Stefan Rüping (in German only, sorry):

http://www-ai.cs.uni-dortmund.de/auto?self=$Publication_1048264721699

and I think also in this paper:

http://www-ai.cs.uni-dortmund.de/auto?self=$Publication_1059736767197


Of course you could get a lot more information about univariate and multivariate (time) series predictions in our training courses ;-)

Cheers,
Ingo

Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Braulio Medina
Guest
« Reply #3 on: May 25, 2008, 04:12:28 AM »

Hello Ingo,

Would you suggest me any special technique to predict stock prices time series?

I am absolutely sure that you approach this issue in a training course at Rapid-i. Unfortunately, I won´t be able to fly to Europe until the end of the year. I am a brazilian mathematician/entreprenuer looking forward to beat the market using time series predicition  Wink

Hope to meet your team in the future,

Viele Grüsse aus dem sonnigen Rio,

Braulio
Logged
marcel
Guest
« Reply #4 on: May 26, 2008, 10:00:33 AM »

Hello,

Do you have training facilities in France? It is somewhat difficult to come over to Dortmund! On the other hand your answers on this forum are so good, that I pose you another question: How do I put the records in the input file? Recent date first or the oldest date first? Does it care in which order I put the records in the timeseries inputfile?

Thanks a lot
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #5 on: May 26, 2008, 05:47:38 PM »

Hi Braulio,

Would you suggest me any special technique to predict stock prices time series?

I am absolutely sure that you approach this issue in a training course at Rapid-i. Unfortunately, I won´t be able to fly to Europe until the end of the year. I am a brazilian mathematician/entreprenuer looking forward to beat the market using time series predicition  Wink

For stock prices basically the same applies then for other time series predictions as well. However, the amount of history time which should be regarded differs a lot, for example, between currency trading and actual stock predictions. In addition to the mere historical data, I would also recommend to extract additional features (indicators) from you series data which could be used to support the predictions. By saying that, you should take into account not only the usefulness of the indicators but also the fact that others also calculate and use them. In the past, we also combined the single stock values and extracted indicators with macro-economical features.

The basic idea is always the same: use windowing to transform series data into features describing the history for the current time point (by using the windowing approach described above). Then add indicators extracted from the complete series (for example with the operators from the value series plugin) so far or only from the regarded history or any other time window (e.g. with the ExampleSetJoin operator). Then add additional features describing economics etc (again with the ExampleSetJoin). Learn the prediction model from the complete aggregated feature set and predict either the actual value or triggers to buying or selling. And you will see: sometimes you beat the markets, sometimes you won't. So it might be important to optimize the model and the preprocessing taking the costs into account. For the latter, the support by RapidMiner could be better but it would be quite easy to develop such a cost evaluation operator yourself (or let it be developed).

Your German is great by the way! Of course this would be slightly less impressive if you came from here (or lived here for some time ;-). I would wish that I would be able to send you sunny greetings as well, but actually it's raining cats and dogs right now...

Maybe we meet later this year. I would be looking forwar to this.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #6 on: May 26, 2008, 06:05:03 PM »

Hello Marcel,

Do you have training facilities in France? It is somewhat difficult to come over to Dortmund! On the other hand your answers on this forum are so good, that I pose you another question: How do I put the records in the input file? Recent date first or the oldest date first? Does it care in which order I put the records in the timeseries inputfile?

No, unfortunately we do not have a training facility in France yet. Maybe it is easier for you to come to London? We will provide training courses in London between September 09 and September 12, 2008.

http://rapid-i.com/content/view/110/125/
http://rapid-i.com/content/view/111/126/

It's is the oldest date first if you follow the format described above, i.e.

v1
v2
...
v100
...

Hope that helps,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
deonmfe
Newbie
*
Posts: 5


« Reply #7 on: May 27, 2008, 06:09:07 AM »

Hi,

I am also interested in predicting time-series values using the windowing process. I am however, a bit stuck at the preprocessing stage. If I have too many variables, what would the best preprocessing algorithm or process be to use.
By the way, thanks for the great replies.

Deon
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #8 on: May 27, 2008, 11:25:44 AM »

Hi Deon,

if you have more than one variable, the operator "MultivariateSeries2WindowExamples" has to be used instead of the "Series2WindowExamples" which only works for windowing a univariate (i.e. single attribute) time series. The basic approach is the same but you have to define which of the columns should actually be predicted.

So lets say you have three columns with series data with 100 time points like here:

u1   v1   w1
u2   v2   w2
u3   v3   w3
u4   v4   w4
u5   v5   w5
u6   v6   w6
...
u100  v100  w100


The task now again is to learn from the past to predict a value sometime in the future - but now the history all columns should be taken into account, not only a single one. For solving this task, we will employ a windowing approach like it was described above for the univariate case. Let's say you want to predict the value for the middle (v) column then the result of the windowing (window width 5, step size 1, horizon 1) will look like:

Code:
att1  att2  att3  att4  att5  att6  att7  att8  att9  att10 att11 att12 att13 att14 att15 label
--------------------------------
u1    u2    u3    u4    u5    v1    v2    v3    v4    v5    w1    w2    w3    w4    w5    v6
u2    u3    u4    u5    u6    v2    v3    v4    v5    v6    w2    w3    w4    w5    w6    v7

The first five attribute are the history of "u", the next the history of "v" and the next five are the history of "w" which are taken into account as attributes / variables / features to learn from. The label is the value which should be predicted. It's is simply the next value after the last value of the window of the dimension to predict (since we chose a horizon of 1).

In addition you could of course also add merge other describing attributes to those by using the corresponding operators.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
marcel
Guest
« Reply #9 on: May 27, 2008, 05:22:41 PM »

Hello Ingo,

First of all thanks a lot for your answers which are truly helping us further.... in very small steps... We are trying to implement the MultivariateSeries2WindowExamples operator, following your example of three columns (window width 5, step size 1, horizon 1). Our settings are the following:   

<operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
<parameter key="create_single_attributes"   value="false"/>
<parameter key="label_dimension"   value="1"/>
 <parameter key="series_representation"   value="encode_series_by_examples"/>
 <parameter key="window_size"   value="5"/>

Problem: The result file doesn't give us back a window for all 3 variables like in your example. It only provides a window for the last variable/column in the datafile? It seems that we didn't initialise properly the "Multivariate" function, or we are still in Univariate mode. A second question: Is it possible to enrich the initial datafile with the newly created window information, Is there something like a join operator. I guess positif....

Danke sehr!
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #10 on: May 27, 2008, 07:44:04 PM »

Hello,

here is a small example:

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="DataGeneration" class="OperatorChain" breakpoints="after" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="number_of_attributes" value="3"/>
            <parameter key="target_function" value="random"/>
        </operator>
        <operator name="FeatureNameFilter" class="FeatureNameFilter">
            <parameter key="filter_special_features" value="true"/>
            <parameter key="skip_features_with_name" value="label"/>
        </operator>
        <operator name="ChangeAttributeNameU" class="ChangeAttributeName">
            <parameter key="new_name" value="u"/>
            <parameter key="old_name" value="att1"/>
        </operator>
        <operator name="ChangeAttributeNameV" class="ChangeAttributeName">
            <parameter key="new_name" value="v"/>
            <parameter key="old_name" value="att2"/>
        </operator>
        <operator name="ChangeAttributeNameW" class="ChangeAttributeName">
            <parameter key="new_name" value="w"/>
            <parameter key="old_name" value="att3"/>
        </operator>
    </operator>
    <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
        <parameter key="label_dimension" value="1"/>
        <parameter key="series_representation" value="encode_series_by_examples"/>
        <parameter key="window_size" value="5"/>
    </operator>
</operator>


Please note that the DataGeneration just produces random data (so don't expect to learn good models from this data set)- The data set has three columns (I renamed them to u, v, and w) containing the series values. We have 100 points of times (encoded by the examples, hence the setting in the windowing operator).

After the breakpoint was reached, you will see the data set. After resuming the process, the result is a windowed data set containing 15 new attributes (named "Series0" until "Series14" - we should rename those...) and a label taken from the "v" column (the column with index 1.

From this windowed data set we can now learn an arbitrary regression model like it was described below.

Hope that helps,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
deonmfe
Newbie
*
Posts: 5


« Reply #11 on: May 28, 2008, 01:28:19 AM »

Hi Ingo,
 Thanks for the reply, it was quite helpful. I have two more questions I would like to trow your way and would greatly appreciate your response.

I set up the process and everything works fine, except that I run out of memory (stack overflow) using the MultiLayerPerceptron as a learner because I start off with too many variables (89). So here is my first question:

Before employing the MultivariateSeriesToWindowExamples algorithm, which algorithm should I use to decrease the number of initial variables according to intercorrelations and covariances?

Now for my second question:

In the process you posted above, you added an algorithm "FeatureNameFilter", could you please explain why you need to filter the name "label", is ti because you do not want the computer to "see" the variable it is supposed to predict? Forgive me if I've got the cat by the tail, I'm a relative newbie to this field of work.

Thank a lot again for the speedy replies.

Deon
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #12 on: May 28, 2008, 07:03:05 AM »

Hi Deon,

you can use the different features / attribute weighting operators available or one of the operators CovarianceMatrix (which is not able to produce feature weights) or CorrelationMatrix (which is also available to produce feature weights with a certain setting)  and use the operator AttributeSubsetPreprocessing right after that. This might look like:

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator" breakpoints="after">
        <parameter key="number_of_attributes" value="89"/>
        <parameter key="target_function" value="random"/>
    </operator>
    <operator name="CorrelationMatrix" class="CorrelationMatrix">
        <parameter key="create_weights" value="true"/>
        <parameter key="squared_correlation" value="true"/>
    </operator>
    <operator name="AttributeWeightSelection" class="AttributeWeightSelection" breakpoints="after">
        <parameter key="weight" value="0.9"/>
    </operator>
    <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
        <parameter key="label_dimension" value="0"/>
        <parameter key="series_representation" value="encode_series_by_examples"/>
        <parameter key="window_size" value="5"/>
    </operator>
</operator>

Another idea might be to reduce the window width since this further increases the number of resulting attributes. The total number of attributes after windowing is number_old_attributes * window_width.

About the FeatureNameFilter: I simply used it to create a data set in exactly the same "look and feel" than the one we discussed here. The windowing operator creates a label on its own by using the values after the horizon from the specified label column so I don't need an extra label here. So no magic here  Wink

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
marcel
Guest
« Reply #13 on: May 28, 2008, 02:28:47 PM »

Hello Ingo,

We've got the multivariate operator running as shown in your example. It is not easy but every day we are getting closer to our goal, which is to predict trafic usage data for the next day(s). In order to do so we are experimenting with the MLP operator because of the numeric type attributes. But, when we are using the network on the windowed usage data (5 days) and try to predict the next day figure, the prediction is quite the opposite as what we expect it to be. In fact, it seems that the network isn't predicting the value for the next day (date+1) but simple repeats yesterday's value (date-1). Perhaps, because the correlation is the biggest at this point? Do you have any idea what are we doing wrong or dit we create a NN that predicts the past.... ? Here our settings for the MultivariateSeries2WindowExamples operator:

        <operator name="MultivariateSeries2WindowExamples" class="MultivariateSeries2WindowExamples">
            <parameter key="label_dimension"   value="0"/>
            <parameter key="series_representation"   value="encode_series_by_examples"/>
            <parameter key="window_size"   value="5"/>
        </operator>
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1211



WWW
« Reply #14 on: May 28, 2008, 07:15:22 PM »

Hi Marcel,

this is where the "fun" begins...

If the model is not able to get the underlying process right, what would probably be the best prediction? Right, the last known one. So I would personally try to

- change the learner (by the way: neural nets are not really known to work well on high-dimensional data), for example, try SVM or other linear and non-linear regression schemes
- change the learning parameters, let's for example say the kernel parameters of the failure costs of a Support Vector Machine
- change the windowing parameters, e.g. try different amounts of history (window widths)

You could of course optimize the structure and / or the parameters automatically by using the appropriate parameter optimization operators.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Pages: [1] 2 3
  Print  
 
Jump to: