Pages: [1]
 Author Topic: [Solved] Another kind of performance measurement for time series  (Read 2466 times)
qwertz
Full Member

Posts: 131

 « on: March 14, 2013, 01:55:52 PM »

Especially in financial data mining one would build a model not on the actual stock price but on the difference to the last day.
Consequently, the result of a prediction process will be an estimation about the change of the price from one day until the next.

The currently available "forecasting performance" operator for series determines whether the prediction trend is correct.
(e.g. delta[today] = 4; delta[prediction for tomorrow] = 6; delta[tomorrow] = 5 >> trend is true because tomorrow>today AND prediction>today)

In order to determine win/loss this is not sufficient.
(e.g. delta[today] = -4; delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend is true but the share still loses value)

Hence, the main question rather is wether delta[tomorrow] will be positive or negative.
(e.g. delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend should be true because prediction and tomorrow have the same sign)
(e.g. delta[prediction for tomorrow] = 4; delta[tomorrow] = -1 >> trend should be false)
(e.g. delta[prediction for tomorrow] = 1; delta[tomorrow] = 3 >> trend should be true)

Can anyone help how to realize this kind of performance measurement?

PS: With the existing operator I discovered pretty good prediction trend accuracy rates of 0.7 to 0.8 but the overall win/loss simulation was only slightly above 0.5 due to the issue described above. So I was wondering whether another data preprocessing could help (e.g. transform the stock values into binominal data like "up" and "down" but SVMs are not able to handle binominal data). So far I calculate the daily percental change for all attributes and the label. The best correlating attributes are then used to build a model in the SVM. Does anyone happen to know wether there are other essential steps in preprocessing to improve prediction quality?

Thank you for your help!

Kind regards
Sachs
 « Last Edit: June 10, 2013, 07:46:51 PM by qwertz » Logged
Marius Helf
Hero Member

Posts: 1805

 « Reply #1 on: April 08, 2013, 01:34:26 PM »

Hi,

you could probably use a combination of Generate Attributes and Aggregate to calculate any desired performance measure. Of course those operators work on example sets and write their results into an example set, but once you have the final value you can extract it as a performance measure with the Extract Performance operator with performance_type set to data_value.

Hope this helps!

Best regards,
Marius
 Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
wessel
Hero Member

Posts: 564

 « Reply #2 on: April 08, 2013, 02:01:39 PM »

Use the script operator?

Alternatively, convert your data to differences each day, so data points are actual deltas that are computed in advance?
 Logged
qwertz
Full Member

Posts: 131

 « Reply #3 on: April 08, 2013, 10:02:12 PM »

Thank you for your replies!

I am not familiar with the script operator yet - so I am going to try the combination of Generate Attributes and Aggregate first.

I don't get the second part on converting to differences each day. The input data is already the difference. But if the predicted trend is positive it doesn't mean necessarily an absolute positive result as today's difference could be e.g. -5 and prediction is -3. So the trend is up but still the overall result is negative.

Best regards
Sachs
 Logged
wessel
Hero Member

Posts: 564

 « Reply #4 on: April 08, 2013, 11:15:11 PM »

Write me some pseudo code, I can write you the script operator code.

There is the operator called "Predict Series"

This gives you "real" and "predicted" for your label attribute.

So you have 2 arrays with N data points

real.length() = N and predicted.length() = N

Can you write pseudo code with this arrays?

 Logged
qwertz
Full Member

Posts: 131

 « Reply #5 on: April 09, 2013, 01:11:43 AM »

Sorry, I think I don't get what you are saying. The main problem is about the evaluation. So far I used the "Forecasting Performance" operator for that.

"Predict Series" unfortunatelly won't work with my example because this operator requires univariate data. However, I believe that the prediction part of the model works already fine.

Best regards
Sachs
 Logged
wessel
Hero Member

Posts: 564

 « Reply #6 on: April 09, 2013, 01:36:00 AM »

You normally do predict series and than script operator to do manually do something very similar to forecasting performance.

As far as I'm aware you can run predict series and then implement a script that does something like:
(e.g. delta[prediction for tomorrow] = -3; delta[tomorrow] = -2 >> trend should be true because prediction and tomorrow have the same sign)
(e.g. delta[prediction for tomorrow] = 4; delta[tomorrow] = -1 >> trend should be false)
(e.g. delta[prediction for tomorrow] = 1; delta[tomorrow] = 3 >> trend should be true)

But maybe I misunderstood from the beginning, in this case I'm sorry.

Best regards,

Wessel
 Logged
qwertz
Full Member

Posts: 131

 « Reply #7 on: May 13, 2013, 04:54:07 AM »

Hi Wessel,

Sorry that it took me so long to answer your kindful offer. Though it it took a long time it doesn't mean that it is less important to me. The delay is caused by a multiple months travel and the access to internet is very limited.

I am going to write some pseudo code and post it in the next days.

Thank you very much!
Sachs
 Logged
wessel
Hero Member

Posts: 564

 « Reply #8 on: May 13, 2013, 10:06:29 PM »

I'm still here
 Logged
qwertz
Full Member

Posts: 131

 « Reply #9 on: May 16, 2013, 05:19:46 AM »

After taking some time to think about the pseudo code it turned out that the formula is similar to the one of prediction trend accuracy (PTA)

PTA is described in http://rapid-i.com/api/rapidminer-4.6/com/rapidminer/operator/performance/PredictionTrendAccuracy.html

Quote
Measures the number of times a regression prediction correctly determines the trend. This performance measure assumes that the attributes of each example represents the values of a time window, the label is a value after a certain horizon which should be predicted. All examples build a consecutive series description, i.e. the labels of all examples build the series itself (this is, for example, the case for a windowing step size of 1). This format will be delivered by the Series2ExampleSet operators provided by RapidMiner.

Example: Lets think of a series v1...v10 and a sliding window with window width 3, step size 1 and prediction horizon 1. The resulting example set is then

T1 T2 T3 L P
---------------
v1 v2 v3 v4 p1
v2 v3 v4 v5 p2
v3 v4 v5 v6 p3
v4 v5 v6 v7 p4
v5 v6 v7 v8 p5
v6 v7 v8 v9 p6
v7 v8 v9 v10 p7

The second last column (L) corresponds to the label, i.e. the value which should be predicted and the last column (P) corresponds to the predictions. The columns T1, T2, and T3 correspond to the regular attributes, i.e. the points which should be used as learning input.

This performance measure then calculates the actuals trend between the last time point in the series (T3 here) and the actual label (L) and compares it to the trend between T3 and the prediction (P), sums the products between both trends, and divides this sum by the total number of examples, i.e. [(if ((v4-v3)*(p1-v3)>=0), 1, 0) + (if ((v5-v4)*(p2-v4)>=0), 1, 0) +...] / 7 in this example.

In contrast to PTA I need a formula which calculates [(if ((v4)*(p1)>=0), 1, 0) + (if ((v5)*(p2)>=0), 1, 0) +...] / 7
In other words: The substraction is left out.

I would appreciate your help very much!!

Kind regards
Sachs
 Logged
qwertz
Full Member

Posts: 131

 « Reply #10 on: June 10, 2013, 07:45:28 PM »

Thanks to Wessel here is the piece of code which creates almost every performance one could imagine.

Cheers
Sachs

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data" compatibility="5.3.008" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
<operator activated="true" class="subprocess" compatibility="5.3.008" expanded="true" height="76" name="Subprocess" width="90" x="179" y="30">
<process expanded="true">
<operator activated="true" class="generate_attributes" compatibility="5.3.008" expanded="true" height="76" name="Generate Attributes" width="90" x="45" y="30">
<list key="function_descriptions">
<parameter key="new_performance" value="1*2"/>
</list>
</operator>
<operator activated="true" class="extract_performance" compatibility="5.3.008" expanded="true" height="76" name="Performance" width="90" x="180" y="30">
<parameter key="performance_type" value="statistics"/>
<parameter key="attribute_name" value="new_performance"/>
</operator>
<connect from_port="in 1" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Performance" to_port="example set"/>
<connect from_op="Performance" from_port="performance" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<connect from_op="Generate Data" from_port="output" to_op="Subprocess" to_port="in 1"/>
<connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
 Logged
 Pages: [1]