Welcome,
Guest
. Please
login
or
register
.
Did you miss your
activation email?
Home
Help
Search
Login
Register
Rapid-I
Rapid-I Forum
»
RapidMiner
»
Data Mining / ETL / BI Processes
»
[SOLVED] Overlapping folds in cross validation?
Pages: [
1
]
« previous
next »
Print
Author
Topic: [SOLVED] Overlapping folds in cross validation? (Read 472 times)
siamak_want
Jr. Member
Posts: 87
[SOLVED] Overlapping folds in cross validation?
«
on:
July 27, 2012, 09:54:13 AM »
Hi forum,
Today, I read many helpful posts about cross validation (x-validation). But still I have one important question: Do the folds, which are constructed, "overlap" with each other? I mean do they have any duplicated data point or they are completely separated folds with no overlap?
You know in RM we have 3 types of cross validation sampling: "linear", "shuffled" and "stratified". I think choosing linear sampling makes non-overlapping folds but the other two may construct overlapping folds. But I experienced a very astonishing result: When I used 10 folds x-val with "linear sampling" I got the accuracy of 31% but when I just choose the "stratified sampling" I got 86% accuracy!!! I am really confused with the results. Does Anyone know how should I evaluate the performance of my model?
I would also really appreciate if someone explain the issue of overlapping folds for cross validation, from academic point of view.
regards,
«
Last Edit: August 31, 2012, 09:32:42 AM by Marius
»
Logged
Marius
Global Moderator
Hero Member
Posts: 1283
Re: Overlapping folds in cross validation?
«
Reply #1 on:
August 20, 2012, 09:54:29 AM »
Hi,
the test sets of the folds do NOT overlap, however the training sets DO overlap: the X-Validation splits the data into (e.g.) 10 partitions. Then it loops the partitions, using the current one as test set and training on the 9 others. Thus, obviously the training sets of the folds overlap.
Using linear sampling, each partition contains examples in the order in which they are in the original data set. If your data is ordered by label or in any other way, your learner probably does not see a representative sample of the data, but only a certain subset, and thus does not generalize well to other data. You should always use "stratified sampling" on data with a nominal label, or "shuffled sampling" otherwise.
Best, Marius
Logged
Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please
click here
before posting.
siamak_want
Jr. Member
Posts: 87
Re: Overlapping folds in cross validation?
«
Reply #2 on:
August 31, 2012, 09:31:23 AM »
Thanks to your nice answer Marius,
So I will always set the sampling type to stratified.
thanks again Marius.
Logged
Pages: [
1
]
Print
« previous
next »
Jump to:
Please select a destination:
-----------------------------
General Community
-----------------------------
=> News and Updates
=> Data Mining
=> Chit Chat
-----------------------------
RapidMiner
-----------------------------
=> Getting Started
=> Data Mining / ETL / BI Processes
=> Problems and Support
=> Feature Requests
=> Development
-----------------------------
RapidAnalytics
-----------------------------
=> Getting Started
=> Applications and Integration
-----------------------------
RapidNet
-----------------------------
=> Getting Started
=> Problems and Support
Loading...