Pages: [1]
  Print  
Author Topic: Instance Selection and Prototype Rules plugin + Examples  (Read 545 times)
marcin.blachnik
Newbie
*
Posts: 22


« on: November 05, 2011, 11:49:34 PM »

Hallo
On Marketplace is avaliable instance selection plugin. I also put some examples on how to use it on MyExperiments.
Here are links to these examples:
Example 1 http://www.myexperiment.org/workflows/2537
Example 2 http://www.myexperiment.org/workflows/2538
Example 3 http://www.myexperiment.org/workflows/2539
Example 4 http://www.myexperiment.org/workflows/2540
Example 5 http://www.myexperiment.org/workflows/2541
Example 6 http://www.myexperiment.org/workflows/2542

Regards
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1196



WWW
« Reply #1 on: November 06, 2011, 12:22:08 AM »

Hi Marcin,

both the extensions as well as the processes on myExperiment work like a charm! Thanks for this great new extension, I am sure many people will appreciate it. Do you have some experiences how well the instance / prototype selection schemes work on larger data sets?

Another question: Does it make much sense to use less instances than classes? In the examples 4 and 6, only 2 clusters are detected and used which of course leads to the effect that only 2 of the 3 classes are predicted. No big deal but at first I wondered for a sec why that was the case...

Anyway, let me thank you again and have a nice weekend,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
marcin.blachnik
Newbie
*
Posts: 22


« Reply #2 on: November 07, 2011, 02:07:31 PM »

Thank you Ingo for remarks  Smiley.
First I did some test on two class problems, and then I switch to Iris dataset, which is a default dataset build into RM , and forgot to make changes in the clustering algorithm.

You've asked about applications to large datasets. Well, I didn't make any analysis on that subject, but I can say that all Instance Selection operators use build in RM tools and data structures, like geometriccollection etc. These operators also don't have any special memory requirements, because all exampleSets are views on datatable. This view is realized by boolean vector with the length equal to the number of samples in the  datatable. This makes the foreach loop very efficient when iterating over examples, so I thing that these operators should be applicable to large datasets.
There is a small difference in clustering and optimization operators like  FCM and LVQ algorithms, where the codebooks or cluster centers are being permanently updated, so starting this operator requires duplicating initial codebooks (prototypes), but I believe it shouldn't affect BigData problems.

Regards
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1196



WWW
« Reply #3 on: November 07, 2011, 06:31:26 PM »

Hi,

Quote
First I did some test on two class problems...

ah, that explains it. I thought I might have misunderstood one of the concepts...

Thanks for your explanations about scalability. Last question on the number of iterations: is only a single iteration over all examples necessary for most of the instance selection schemes (without the optimization based schemes of course)?

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
marcin.blachnik
Newbie
*
Posts: 22


« Reply #4 on: November 09, 2011, 07:38:30 AM »

I'm not sure if I correctly understand your question.
All currently implemented instance selection methods works in a different way. AllKnn,RENN and ENN are all very similar, and are based on ENN algorithm. This algorithm performs Leave One Out test on all examples with kNN classifier where k is usually set to 3 (default parameter), and the if the majority voting of nearest neighbors predicts incorrect class label then such instance is marked for rejection from the ExampleSet. This means that all ENN based algorithm perform something like outlier elimination.
Finaly ENN is one pass algorithm, AllKNN is "m" pass algorithm because the procedure of ENN is performed m=k_stop-k_start times. RENN performes ENN algorithm as long as no instance is rejected.
CNN algorithm works in a different way. It starts from single instance per class and then iteratively adds new instances to the selected exampleSet, such that in the worst case all instances all instances can be selected.
All these algorithms are rather old (known since 70'). In a short future more state of art algorithms will be available. My colleges are now working on algorithms which has linear computational complexity, which should very well scale for large datasets.

Best regards
Logged
Pages: [1]
  Print  
 
Jump to: