dan_


« on: July 24, 2011, 03:35:44 PM » 

Hi,
An option to include the calculation of 1 p_value as a weight for an attribute in the above operator, as an alternative to the the weight given by a chi square statistic value for the same attribute, would be very useful. A button to allow to choose between 1 p_value and the statistic itself, for all the input attributes, would be ideal.
With this facility, one can select the attributes for which there is evidence, from the statistical reasoning point of view, that they are not independent with respect to the label attribute. Indeed, one would choose the computation of 1p_value as a weight per attribute in the above operator, and then would select all the attributes whose weight is at least 0.95.
Moreover, this facility would allow a clear indication, which is statistically supported, whether or not the input attributes are likely to have predictive power with respect to the label attribute. For example if all the input attribute weights (calculated as 1p values, so as complements of p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly, since the data is consistent with the hypothesis that the input attributes are independent with respect to the label attribute. It is not possible to say this thing based on the chi square statistic values. These need to be converted into p values first (or, as suggested, into complements of p values) for more insight on the dataset to mine.
So the weights computed as complements of the p values from Pearson's chi square statistical test can in many cases signal that a dataset is inappropriate for a given classification problem (saving time spent for trying to build various poorly performing models in an attempt to find a good one, that actually is likely not to exist). When the dataset is appropriate, these weights can differentiate attributes for which there is statistical evidence that they are not independent of the label attribute (corresponding to large complements of p values), so that they can be used in the process of building the model. Moreover, sorting attributes according to the complements of p values as weights is similar to sorting attributes according to the less meaningful chi square statistic value weights (that is, one can choose the top k attributes as usual, etc). So why not computing the weights also as the complements of the p values in the Weight by Chi Squared Statistic operator, or simply adding a new  Weight by Chi Square Complement p Value  operator?
Dan


« Last Edit: August 22, 2011, 02:00:20 PM by dan_ »

Logged




haddock


« Reply #1 on: July 27, 2011, 12:39:47 AM » 

Hi Dan, The Wikipedia entry on Pvalues http://en.wikipedia.org/wiki/Pvalue is quite explicit... 1 − (pvalue) is not the probability of the alternative hypothesis being true As the article explains.. Despite the ubiquity of pvalue tests, this particular test for statistical significance has come under heavy criticism due both to its inherent shortcomings and the potential for misinterpretation.
So could you explain a bit further the benefits of the operator you propose, because I'm sure I'm missing something? Many thanks



Logged

Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934



dan_


« Reply #2 on: July 27, 2011, 02:47:29 PM » 

Hi,
Reading an introductory statistics book would help to clarify the fundamentals of statistical reasoning for you.
Statisticians use p values rather than their complements (1p_value), obviously. One rationale of proposing complements of p values as alternative weights in the mentioned RM operator is suggested from the following equivalence that holds for a given degrees of freedom value of the chi square distribution: bigger chi square statistic <> smaller p value <> bigger complement of p value
Practically speaking, RM already employs the chi square statistic whose values are seen as weights that can be used to select desirably the best input attributes in a classification problem. For instance you may pick up the top 10 attributes with the highest chi square statistic values to do your analysis with, as input attributes. Instead of this you may pick the top 10 attributes with the highest complements of p values. The complements of p values as weights may do a similar (if not a better) job to that of the chi square statistic weights.
However pvalues can do more, as explained initially, and for a clear understanding of the explanations provided before about the usefulness of the light extension that was proposed, you may need some proper understanding of the foundations of the statistical reasoning  so I advise you to read a good foundation book in statistics first before tackling the subject further.
Finally, to conclude, obviously that 1p_value is not the probability for the alternative hypothesis to be true. It is the complement of the probability that the chosen statistics (seen as random variable) is more or equal to the value of the statistics computed using the data sample, assuming that the null hypothesis is true. If p_value <=0.05 (or equivalently the complement of the p_value > =0.95) then the null hypothesis is rejected (and implicitly the alternative hypothesis is accepted) at 0.05 level of significance. When the p_value is bigger than the threshold of 0.05 (or equivalently the complement of the p value is smaller than 0.95  and an example value was chosen as 0.4) then this is indication that the data sample is consistent with the null hypothesis. This situation corresponds to smaller values of the chi square statistic, or equivalently to smaller weights for the attributes, as computed by RM's mentioned operator. And since you would have done some work in Data Mining, you know that especially when your data has a high dimensionality (and not only then) you would wish to select attributes with higher weights as computed by this operator for instance (which correspond to higher complements of p values) to get a good model built by employing a part of your dataset only. [[Here the null hypothesis was: an input attribute is independent w.r.t. the class attribute. The alternative hypothesis was the negation of the null hypothesis.]]
Dan


« Last Edit: August 12, 2011, 10:12:50 AM by dan_ »

Logged




haddock


« Reply #3 on: July 27, 2011, 03:15:54 PM » 

Hi Dan, A dirty dozen: twelve pvalue misconceptions. Goodman S. Source Departments of Oncology, Epidemiology, and Biostatistics, Johns Hopkins Schools of Medicine and Public Health, Baltimore, MD, USA. Sgoodman@jhmi.eduAbstract The P value is a measure of statistical evidence that appears in virtually all medical research papers. Its interpretation is made extraordinarily difficult because it is not part of any formal system of statistical inference. As a result, the P value's inferential meaning is widely and often wildly misconstrued, a fact that has been pointed out in innumerable papers and books appearing since at least the 1940s. This commentary reviews a dozen of these common misinterpretations and explains why each is wrong. It also reviews the possible consequences of these improper understandings or representations of its meaning. Finally, it contrasts the P value with its Bayesian counterpart, the Bayes' factor, which has virtually all of the desirable properties of an evidential measure that the P value lacks, most notably interpretability. The most serious consequence of this array of Pvalue misconceptions is the false belief that the probability of a conclusion being in error can be calculated from the data in a single experiment without reference to external evidence or the plausibility of the underlying mechanism. At least I know how little I know, being rude advertises ignorance.



Logged

Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934



Ingo Mierswa


« Reply #4 on: July 27, 2011, 09:49:17 PM » 

Hi Dan,
please stay calm and fair  asking that somebody should first aquire basic knowledge before starting a (from my point meaningful and useful) discussion with you is certainly not a good style of discussion. Haddock certainly is knowing what he is talking about!
Ok, let's come back to the discussion instead of insulting each other, please...
Cheers, Ingo



Logged




dan_


« Reply #5 on: July 30, 2011, 10:51:55 AM » 

Hi, @Ingo: My posting was limited to arguments on the subject, and was clearly fair. Moreover, I have made a pertinent recommendation to someone, in this case Haddock, to read suitable relevant material since he seemed interested in the subject, and seemed to need such a recommendation (no offence). If he finds inconvenient to be recommended a material that would help to improve the own knowledge on the subject, then I can do nothing about it. It is common on this forum to recommend people to follow introductory tutorials or documentation, when it seems they would need it and benefit from. There is nothing rude in this and I see Haddock himself did such recommendations to people not familiar with RM or with its documentation, on a number of occasions. “Haddock certainly is knowing what he is talking about!” No offence, perhaps he was less convincing on this occasion, at least the question in the posting suggested so. The extension of RapidMiner operator I suggested was based on a statistical tool which is fundamental for decision tree algorithms as CHAID and QUEST, whose details Haddock didn’t seem to be aware of. CHAID and QUEST depend heavily on the use of p values and Pearson’s chi square test, and the same test plus ANOVA and Levene F tests, respectively. Moreover the main ideas in these algorithms capture the much simplified idea the operator extension I suggested is based on. So it is certain that knowing and understanding a bit of the mechanism of CHAID or QUEST would make my proposed operator extension seem trivially clear. However these algorithms (or the simple extension I proposed) can be understood better only assuming some good understanding of statistical tests, thus the recommendation I made for consulting a good introductory Stats book. Finally, statistical tests (including p values) are current standard in Stats, are studied at least by Maths/Stats students in all the universities, and are certainly used by two of the major players in the commercial Data Mining software as IBM SPSS Modeler and SAS Enterprise Miner (and not only by them)  see for instance the implementations of CHAID and QUEST. RapidMiner too uses the chi square statistic (in a limited way, unfortunately) which is an inseparable component element of the Pearson’s test mentioned above, as the concept of p value is. For Haddock and for those interested in details on these algorithms, various good documentation is available also online, including from SAS Institute, IBM SPSS (which use them also in their statistical software). For instance these show how p values are employed in selecting predictors / input attributes: http://support.spss.com/productsext/spss/documentation/statistics/algorithms/14.0/TREECHAID.pdf http://support.spss.com/productsext/spss/documentation/statistics/algorithms/14.0/TREEQUEST.pdf @Haddock: Childish manner to end a posting (with that emoticon), for a respected veteran of this forum ... Regarding the paper you quoted above (written by a medical doctor and researcher in oncology, according to his webpage), indeed, it illustrates usual problems researchers in medicine may have with using Statistics properly, in particular statistical tests (and p values as a component concept). These frequent problems of improper use are encountered in other scientific communities whose members are large statistics consumers (e.g. Social Sciences).There has been some controversy regarding the pluses and minuses of statistical tests (and p values), much of it supported also by poor understanding and improper application of these tools, as illustrated by the paper you cite. However statistical tests (and p values) tools are part of the standard in the field of statistical inference (and these tools are what students in Maths/Stats from Harvard, Cambridge, and everywhere else, are currently taught) and remain so as there is no largely accepted better approach. I am sure we are all busy with our work and/or study of Data Mining so let’s focus on this and on related subjects, only, on this forum. Dan


« Last Edit: July 30, 2011, 11:22:38 AM by dan_ »

Logged




haddock


« Reply #6 on: July 30, 2011, 12:23:25 PM » 

Regarding the paper you quoted above (written by a medical doctor and researcher in oncology, according to his webpage), indeed, it illustrates usual problems researchers in medicine may have with using Statistics properly, in particular statistical tests (and p values as a component concept). These frequent problems of improper use are encountered in other scientific communities whose members are large statistics consumers (e.g. Social Sciences). Hmmm, so do we believe Dan, or this researcher? Steven N. Goodman, M.D., M.H.S., Ph.D., is Professor of Oncology in the Division of Biostatistics of the Johns Hopkins Kimmel Cancer Center, with appointments in the Departments of Pediatrics, Biostatistics and Epidemiology in the Johns Hopkins Schools of Medicine and Public Health. Dr. Goodman received a B.A. from Harvard, an M.D. from NYU, trained in Pediatrics at Washington University in St. Louis, received his M.H.S. in Biostatistics, and his Ph.D. in Epidemiology from Johns Hopkins University. He served as codirector of the Johns Hopkins EvidenceBased Practice Center, is on the board of directors of the Society for Clinical Trials, was codirector of the Baltimore Cochrane Center from 1994 to 1998, and is on the core faculties of the Johns Hopkins Berman Bioethics Institute, the Center for Clinical Trials, the Graduate Training Program in Clinical Investigation and the Johns Hopkins Center for the History and Philosophy of Science. He is the editor of Clinical Trials: Journal of the Society for Clinical Trials, has been Statistical Editor of the Annals of Internal Medicine since 1987 and for the Journal of General Internal Medicine from 1999 to 2000. He has served on a wide variety of national panels, including the Institute of Medicine's Committee on Veterans and Agent Orange, and is currently on the IOM Committee on Vaccine Safety, the Medicare Coverage Advisory Commission, and the Surgeon General's committees to write the 2001 and 2002 reports on Smoking and Health. He currently chairs a panel assessing the longterm outcomes of assisted reproductive technologies, established by the Genetics and Public Policy Institute and sponsored by the American Society for Reproductive Medicine and the American Academy of Pediatrics (AAP). He represents the AAP on the Medical Advisory Panel of the National Blue Cross/Blue Shield Technology Evaluation program, and served as a consultant to the President s Advisory Commission on Human Radiation Experiments. He has published over 90 scientific papers, and writes and teaches on evidence evaluation and inferential, methodological, and ethical issues in epidemiology and clinical research. Tough call !


« Last Edit: July 30, 2011, 01:00:57 PM by haddock »

Logged

Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934



Ingo Mierswa


« Reply #7 on: July 31, 2011, 11:51:37 AM » 

Ok guys, this ended exactly in a way we were afraid of... Dan, I do not have any problem with your recommendation to read more material per se. I just meant that starting your first answer to Haddock's reaction with this recommendation might heat the discussion too much. At this point of time, Haddock just has asked for more information and pointed out the fact that there indeed is some discussion about the usefulness of this measure. I got your point that this question of him actually was the reason for your recommendation but please understand that Haddock  who just wanted to start a fruitful dicussion with you  probably looks for more than the answers a) read more material and b) it's also used by others hence it has to be good Haddock, your last answer did also not really help to calm things down. So please: If you have the feeling that the discussion is not giving you the expected information, ask again or simply ignore it. Ok, back to the original topic: I don't have any problem with the suggested extension at all. It is quite straightforward and might help you and others. One of my major concerns with this statistic is the fact that it only takes into account a single feature at a time and does not look at feature subsets. A single feature might not explain anything, a combination of two or more features might explain everything. The most simple and prominent example probably is the XORfunction. Hence, the statement "For example if all the input attribute weights (calculated as 1p values) were under let us say 0.4 in a dataset, then the classification models one would try to build would likely perform poorly" is not necessarily true. However, this is true for almost all other feature evaluation schemes as well and nevertheless many people (including me) find them useful in certain applications. In fact, there are literally hundreds of operators I wouldn't use since I believe (by knowledge or experience and sometimes even by prejudice) that there are better alternatives. But at the same time those are exactly the operators used a lot by others and it's good that they are part of RM. RapidMiner too uses the chi square statistic (in a limited way, unfortunately)...
What do you mean by that? That's there is some error or that the statistic is not available at other places as well? Do you have some recommendation where you think it is missing in this case? I am sure we are all busy with our work and/or study of Data Mining so let’s focus on this and on related subjects, only, on this forum.
Yes, please let's do so! Cheers, Ingo



Logged




dan_


« Reply #8 on: August 09, 2011, 03:10:20 PM » 

Hi Ingo,
Thanks for your comments. That's rather a busy period, but I'll get back on your points.
Thanks, Dan



Logged




dan_


« Reply #9 on: August 12, 2011, 11:52:19 AM » 

Hi Ingo, A few remarks regarding your points. ... probably looks for more than the answers a) read more material and b) it's also used by others hence it has to be good Got your point. Note however that you omitted one essential aspect: there was more than a) and b). Primarily some technical details and explanations for the main idea (i.e. to use p value complements as weights to select predictive attributes) had been provided in the initial posting (including justifications based on Math Statistics). If the lengthy posting was insufficient, then certainly some intro Stats reading would have helped I guess. Regarding your point b) above, perhaps the best remark I can make here, would be the fact that most of the software appearing in the upper half in the result of the kdnudggets’ 2010 poll http://www.kdnuggets.com/polls/2010/datamininganalyticstools.html regarding the use of Data Mining/analytics tools, make use of statistical tests and p values in their algorithms. I refer here to: R, Excel, Statsoft Statistica, SAS (Stats), SAS Enterprise Miner, IBM SPSS Statistics, IBM SPSS Modeler, Matlab, Microsoft SQL Server, Oracle Data Mining, Weka. Notable names from this upper half poll result that seem not to make use of p values (yet) include RapidMiner and KNIME. Perhaps I will come back on this list. However, when there is such an omnipresent use of statistical tests and p values, even those of us that would not have a background in Mathematical Statistics and/or Computer Science (in order to better understand them and their use in Data Mining algorithms) are likely to realise that there must be something good about these statistical concepts. Obviously why p values are good to use has primarily been justified with other arguments than just saying: others use them. I don't have any problem with the suggested extension at all. It is quite straightforward and might help you and others. Sounds good, thanks. ... is not necessarily true. However, this is true for almost all other feature evaluation schemes as well and nevertheless many people (including me) find them useful in certain applications. In fact, there are literally hundreds of operators I wouldn't use since I believe (by knowledge or experience and sometimes even by prejudice) that there are better alternatives. Obviously in feature selection, the only proven best method is that in which all subsets of input attributes are evaluated. Since this method is extremely impractical, we come to the use of heuristics (for the non computer scientists on the forum, heuristics are algorithms that provide approximate or partial solutions, that are computationally cheaper, so they may be good alternatives to a computationally expensive method that would provide a complete, exact solution). So yes, the heuristic based on chisquare test considers one attribute at a time, which sometimes may prove to be a disadvantage, although in practice it works very well. But all the heuristics do have their disadvantages, don’t they?, including your preferred ones, that’s why they are just heuristics. Moreover, it is hard to demonstrate that a heuristic in feature selection performs better than another one in all circumstances. Method A can work better than method B on a dataset, and worse than method B on another dataset. In particular I doubt that you could demonstrate that your favourite feature selection heuristic gives a better solution than the chisquare test heuristic on each dataset. That would have been an outstanding research paper I guess In practice the best would be to possibly try some feature selection heuristics and stick to one that works fast enough and provides a good result in that particular problem. I often use the statistical tests (chisquare) to select the best features, and this works very well for most of my problems. In addition to the support it gets from its theoretical foundation, one notable plus of this heuristic is that it is very cheap computationally, so very fast. People interested in details regarding this method may want to have a look in Han’s Data Mining book (for newcomers in the field, this is one of the mostly used textbooks in Data Mining university courses, and popular among practitioners  tools users and tools implementers). In the Data Preprocessing chapter, where the chisquare statistical test is presented, one says “the ‘best’ (and ‘worst’) attributes are typically determined using tests of statistical significance ”. One obviously refers to the statistical tests; moreover significance here is related to the so called significance levels (typically 0.05 or 0.01), that are thresholds for the p value. In a next posting I will probably refer to another very popular book in (Statistical) Data Mining and Machine Learning, which is, no doubt, known by most of you guys on this forum, especially if you have a Computer Science or Math Stats background – it’s Hastie’s book. That’s an excellent reading, and there we can see again statistical tests and p values at work. Hopefully we will finally see p values at work in RM too ‘RapidMiner too uses the chi square statistic (in a limited way, unfortunately)...’
What do you mean by that? That's there is some error or that the statistic is not available at other places as well? Do you have some recommendation where you think it is missing in this case? I mean that in RM the chi square statistic could be better used together with p values in feature weighting / selection as discussed so far, and primarily in the implementation of decision tree algorithms as CHAID and QUEST for instance (along with other statistical tests and their own statistics measures). Regards, Dan


« Last Edit: August 12, 2011, 12:10:19 PM by dan_ »

Logged




wessel


« Reply #10 on: August 12, 2011, 01:26:02 PM » 

Dear All,
Where exactly should Rapid Miner display pvalues?
In the weights by Chi Squared Statistic? Like for the iris data set, it should have an extra column, with pvalue? a2 0.0 a1 0.28633153931967253 a3 0.8971556723299764 a4 1.0
Best regards,
Wessel



Logged




dan_


« Reply #11 on: September 29, 2011, 03:38:08 PM » 

Hi Wessel,
Sorry for my late reply. For the calculation of the pvalue one should consider nonnormalised weights (yours seem to be normalised). In addition the number of distinct values of the either nominal or discretized numeric attribute for which we compute the pvalue, and the number of classes, need to be taken into account in the calculation. I will post an example.
Regards, Dan



Logged




dan_


« Reply #12 on: November 20, 2012, 02:07:34 PM » 

Hi, Apologies for not having responded to all queries asked here, it may have been because a hectic schedule in that period. I referred, in this topic, to some books that may be useful to anybody that is looking to get more knowledge and better understanding of Data Mining. I promised also to recommend some good Data Mining books to one of the users having posted here, Haddock, which, despite showing an authoritative position on this subject, did not seem to have sufficient knowledge of Statistical Data Mining (in particular regarding Data Mining algorithms using pvalues). These are among the best Data Mining books, providing solid foundations in the field. Here are the titles. Enjoy! Best, Dan Note: All these books describe, among many other popular data mining techniques, also techniques using pvalues and/or significance levels (which are threshold values for pvalues) 1. Introduction to Data Mining, by Tan, Steinbach and Kumar, (Addison Wesley) Courses tackling Statistical Aspects of Data Mining and based on the above book (and the use of R) have been taught at Stanford and disseminated through Google Tech Talks see recorded sessions at http://www.youtube.com/watch?v=zRsMEl6PHhM 2. Data Mining Concepts and Techniques, by Han, Kamber and Pei, (Elsevier) One of the most popular books in university Computer Science courses, and among researchers  3. Data Mining  Practical Machine Learning Tools and Techniques, by Witten, Frank, and Hall, (Elsevier) Again, one of the excellent and most popular books in university Computer Science courses, from the authors that produced also Weka  4. The Elements of Statistical Learning: Data Mining, Inference and Prediction, by Hastie, Tibshirani and Friedman, (Springer) One of the most popular books in university Statistics and Computer Science courses, and one of the most praised by researchers too. A copy can be downloaded for free (great, indeed!) from the authors' webpages at the Dept of Statistics at Stanford University  5. Data Mining Techniques, by Linoff and Berry, (Willey) One of the most popular books among Data Mining professionals (also used in some university courses), written by very respected guys with long handson experience in Data Mining


« Last Edit: November 20, 2012, 02:29:09 PM by dan_ »

Logged




haddock


« Reply #13 on: November 20, 2012, 03:21:31 PM » 




Logged

Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?
T.S.Eliot ~ Choruses from the Rock 1934



dan_


« Reply #14 on: November 20, 2012, 03:51:25 PM » 

Haddock, you provided a webpage link, but what's your point? What do you want to say? Express yourself please. By the way, have you read any of the books indicated above? As a data miner, it's good to read at least one of these. It would be very beneficial for your general expertise in the field. Especially when you don't have a background in Computer Science (as it may be the case with you), you may need to read a good foundation Data Mining textbook. Anyway, this would be good before expressing yourself authoritatively on this Data Mining forum.


« Last Edit: November 20, 2012, 04:28:35 PM by dan_ »

Logged




