Pages: [1]
  Print  
Author Topic: Balanced sampling for network training?  (Read 2704 times)
chaosbringer
Newbie
*
Posts: 21


« on: January 12, 2011, 04:35:27 PM »

Hi,
i have a very imbalanced sample set, e.g. 99% true and 1% false. Is it reasonable to select a balanced subset with a 50/50-distirbution for neural network trainining? the reason for this is, that i guess training on the original dataset may induce a bias on the true-samples.
Can you suggest my some literature that covers this topic especially for neural netowrks?

Thank you very much,
chaosbringer
Logged
steffen
Sr. Member
****
Posts: 376



« Reply #1 on: January 14, 2011, 08:49:33 AM »

Hi chaosbringer

I recommend to ask the question on http://stats.stackexchange.com/

Although we try to ask general questions about data mining here, the amount of experts with time is quite low. Nevertheless it would be great if you post the link to the question here (if you going to ask there).

greetings,

steffen

Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
spitfire_ch
Newbie
*
Posts: 39


« Reply #2 on: January 15, 2011, 07:43:24 PM »

I just stumbled over the answer to his question at said site:

http://stats.stackexchange.com/questions/6254/balanced-sampling-for-network-training

Dikran Marsupial:   
Yes, it is reasonable to select a balanced dataset, however if you do your model will probably over-predict the minority class in operation (or on the test set). This is easily overcome by using a threshold probability that is not 0.5. The best way to choose the new threshold is to optimise on a validation sample that has the same class frequencies as encountered in operation (or in the test set).

Rather than re-sample the data, a better thing to do would be to
give different weights to the positive and negative examples in the training criterion. This has the advantage that you use all of the available training data. The reason that a class imbalance leads to difficulties is not the imbalance per se. It is more that you just don't have enough examples from the minority class to adequately represent its underlying distribution. Therefore if you resample rather than re-weight, you are solving the problem by making the distribution of the majority class badly represented as well.

Some may advise simply using a different threshold rather than reweighting or resampling. The problem with that approach is that with ANN the hidden layer units are optimised to minimise the training criterion, but the training criterion (e.g. sum-of-squares or cross-entropy) depends on how the behaviour of the model away from the decision boundary rather than only near the decision boundary. As as result hidden layer units may be assigned to tasks that reduce the value of the training criterion, but do not help in accurate classification. Using re-weighted training patterns helps here as it tends to focus attention more on the decision boundary, and so the allocation of hidden layer resources may be better.

For references, a google scholar search for "Nitesh Chawla" would be a good start, he has done a fair amount of very solid work on this.


Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #3 on: January 16, 2011, 10:52:57 AM »

There may be other answers as well..

http://www.google.fr/search?q=imbalanced+neural+network

Now all we have to do is work out the right one, or whether there can be a right one  Grin
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
chaosbringer
Newbie
*
Posts: 21


« Reply #4 on: January 31, 2011, 07:21:33 PM »

Hi,
if i understand the post an stackexchange.com correct, it is suggested to weight the samples. I think the operator for this task is "Generate Weights (Straified)" in rapidminer.
However, is there a way in weighting, if the label is numeric? Is this the purpose of the Operator "Generate Weight (LPR)"? I don't really understand the use of the operator from its description.

Thank you very much.
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #5 on: February 10, 2011, 11:58:07 AM »

Hi,
if your label is numeric, you don't have a classification and hence no classes and hence no class imbalance.

If you have true and false, you have no numbers. If true and false are encoded by numebrs, you will need to turn the attributes to nominal ones by applying Numerical to Binominal.

Greetings,
  Sebastian
Logged
rakirk
Newbie
*
Posts: 31


« Reply #6 on: February 17, 2011, 06:11:25 AM »

Weighing could help decrease the error rates. I'd be curious to see what you found. I have typically used ~2/3 training/total.

A difference matrix may also be useful to preprocess the data. Dimensional reduction may increase the ability to discriminate between true/false.

A couple ideas- hopefully you find something that works.
Logged
Pages: [1]
  Print  
 
Jump to: