Pages: [1] 2
  Print  
Author Topic: MicroArray Feature Selection Extension (With RFE!!!)  (Read 3481 times)
Ben
Newbie
*
Posts: 24


WWW
« on: October 14, 2010, 03:50:52 PM »

Hi all,

I have been working on highdimensional microarray data for some time now. As a byproduct I have implemented a handful of feature selection and -weighting operators. These contain amongst others Recursive Feature Elimination (RFE) and minimum Redundancy Maximum Relevance Feature Selection (MRMR) / Correlation based Feature Selection (CFS) (both nearly equivalent). Furthermore I included a wrapper for ensemble feature selection.

As others might find these operators helpful I packed them into a RM5-Extension. You can get it at

https://sourceforge.net/projects/rm-featselext/
or
http://www-ai.cs.tu-dortmund.de/SOFTWARE/MFSPLUGIN/index.html
and
http://www-ai.cs.tu-dortmund.de/auto?expr=Software

And, yeah, it's open source.

If you have any questions or patches  Wink, just contact me.

Ben


P.S.
Before you ask, heres the list of all operators:

Meta-Operators:
  • EnsembleFeatureSelection repeatedly applies the inner feature weighting operator on variations (subsampling/resampling) of the input data and returns the average weights. This gives more robust results.
  • RecursiveFeatureElimination a generic wrapper for performing RFE with any (multivariate) weighting operator inside. Leaves the opportunity to e.g. optimize an SVM's C in each iteration.
  • WindowedWeighting slides a window over the features, applies the inner weighting on each window. Makes most sense to be used in a recursive manner.
Feature selection and -weighting:
  • MRMRFeatureSelection Minimum Redundancy Maximum Relevance Feature Selection by Ding&Peng 2003, same as Correlation Based Feature Selection by Mark Hall, 2000
  • CorrelationBasedWeakAssociations
  • SAMWeighting a univariate test from "Significance Analysis for Microarrays" (Tusher et al. 2001)
  • ExonWelchTest a univariate test for ranking features. This operator was written by student of mine.
  • RecursiveFeatureEliminationSVM RFE with hard-wired linear SVM. A little faster than the wrapper-operator.
  • FeatureQuantileFilter Removes all features which have a p-quantile smaller than t. Helps to filter noise.
  • MRMRFeatureSetEvaluator Evaluate e feature set according to the MRMR-criterion. This allows to combine MRMR with other search strategies, e.g. multi-criteria evolutionary optimization.
  • MaximumRelevanceWeighting ranks feature by mutual information, linear correlation or F-test, dependent on whether feature and label are numerical or categorical.
Classification:
  • LARS "Least Angle Regression" and LASSO (Efron et al. 2004)
  • SAM "Significance Analysis for Microarrays" (Tusher et al. 2001)
Helper:
  • TopK Receives AtrributeWeights and sets the top k entries to 1 and the rest to 0. Useful in Wrapper-Validation. NOT WORKING at the moment due to a bug in RM, sorts in wrong direction, will be fixed either via RM or by me
  • ValueReplenishmentWithOffset Same as the ValueReplenishment Operator in RM but with the ability to add an offset.

« Last Edit: December 08, 2010, 03:15:55 PM by Ben » Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
wessel
Sr. Member
****
Posts: 366


« Reply #1 on: October 14, 2010, 09:49:52 PM »

I read this, but I did not understand what I actually read.

This is feature is a feature subset selection algorithm specially designed for Micro array data?
How is the optimal feature subset defined?
As the feature subset that produces the classifier with minimal error on the test set?

What is wrong with the wrapper approach?
This has more intelligent search?
Logged
Ben
Newbie
*
Posts: 24


WWW
« Reply #2 on: October 15, 2010, 09:05:44 AM »

Hi wessel,

Quote
I read this, but I did not understand what I actually read.
Ok, I'm glad you asked. I will try to clarify.

Quote
This is feature is a feature subset selection algorithm specially designed for Micro array data?
How is the optimal feature subset defined?
As the feature subset that produces the classifier with minimal error on the test set?

The Plugin/Extension does not consist of one single feature selection algorithm or classifier.
It is a small collection of feature selection methods that were published with a focus on microarray data and which were not available in RM (at that time(?)). To try them out on my data, I had to implement them in RM. The list of implemented methods is shown in my previous post. I'll edit it and add some explanation to the operators.


Quote
What is wrong with the wrapper approach?
This has more intelligent search?

Microarray data usually has LOTS of feature, 1000s to 100.000s. My particular dataset has >280.000 features. Conducting a Forward Selection with a wrapper and a 10fold cross-validation with, e.g. an SVM for training, for selecting only 10 features you would have to train roughly ~ 10 * 280.000 * 10 =  28.000.000 SVMs. Don't even try to think of backward-elimination as the models have to be built on very highdimensional data.
Yes, you could reduce from 10fold to e.g. 5fold and use NaiveBayes instead of SVM, but it would still take ages. And with the old* RM forward selection you would run out of memory.
*I know the new forward selection improved on this, but I haven't tried yet.

So that's the computational issue. Also, wrapper searches with a a certain learner may overfit to that specific learner.
In microarray literature it mostly boils down to using a statistical test to rank the features and then select the k most important features. But by the time more sophisticated methods have been developed that take feature interactions into account. Examples for such are RFE, MRMR, CFS, CBFS, FCBF. Some of them define an optimal feature set, some (most) just rank the features.

Hope I could shed some light on the topic.
Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
wessel
Sr. Member
****
Posts: 366


« Reply #3 on: October 15, 2010, 10:53:35 PM »

Thanks for the explanation.

Yes, the wrapper approach certainly isn't always feasible.

I have used the WEKA CFS implementation a lot.
Seems to work quite well.

What I recall from my last data mining course, is that when dealing with high dimensional data,
for which the number of instances (rows) is less then the number of attributes (columns),
the killer approach to dimensionality reduction is clustering.

You take the transpose of the data, apply clustering, and it will cluster attributes together.

Another approach that works quite well, is recurrent neural networks,
which is some generalistion of the Boltzmann machine.
The idea is that the input to the network must be the same as the output of the network,
and the hidden nodes represent the reduced attribute space.
This often takes a while to compute, but memory wise it is not a problem.

Best regards,

Wessel
 
Logged
Ben
Newbie
*
Posts: 24


WWW
« Reply #4 on: October 26, 2010, 03:42:33 PM »

I just found out, that the TopK-Operator isn't working properly due to a bug/misunderstanding in RapidMiner.
The sorting directions got mixed up. So at the moment the bottom k are returned Sad
I've filed this as bug 437 http://bugs.rapid-i.com/show_bug.cgi?id=437. Either I or Rapid-I will fix this.

I fixed this. See below.
« Last Edit: November 05, 2010, 02:01:40 PM by Ben » Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Ben
Newbie
*
Posts: 24


WWW
« Reply #5 on: October 26, 2010, 04:16:46 PM »

@Wessel: Thanks for the useful comments. I haven't tried Recurrent neural networks yet, but they sure sound promising. As far as I see they not implemented in RapidMiner!?! Which implementation are you using?

Funny thing about clustering: I started such a clustering process on my data 131 examples ~280.000 features the day before I published my plugin and it is still running. Given that runtime the results must be awesome Smiley

The last time I did this the results weren't too bad, but did not justify the runtime.

Bye,
Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
wessel
Sr. Member
****
Posts: 366


« Reply #6 on: October 26, 2010, 11:21:52 PM »

Hey,

Which clustering algorithm are you using?
I never realised that some clustering algorithms do not scale up very well, but ye, some off scale really bad! :-/
I wonder if you could use an association rule learning instead of a clustering algorithm to group together attributes.
www.hku.hk/dupad/asiagis/fall03/Full_Paper/Li_Pingxiang.pdf
This algorithm is designed to scale up very well, but it does not deal with continuous data very well.

I am using the Neural Network Toolbox from MATLAB.
It's mandatory for me to use at University :-/, but I would not recommend anyone to use it.
True, rapid miner does not support recurrent neural networks, because it can only have a single output.
When I find a work around for this problem I'll let you know.

I have implemented my own Neural Network in Java before.
The feed forward part is only about 20 lines of code.
Since this data is so enormously big, it is probably beneficial to compile your own code.
What do you want this implementation to be good at?

   public double neuralnetwork(double[] input) {
      double[] hidden = new double[HIDDENS];
      for (int h = 0; h < HIDDENS; h++) {
         hidden[h] = getWeight(h, INPUTS); // Threshold weight
         for (int i = 0; i < INPUTS; i++) {
            hidden[h] += input * getWeight(h, i);
         }
      }
      for (int h = 0; h < HIDDENS; h++) {
         hidden[h] = Math.tanh(hidden[h]); // squash between -1 and 1
      }
      double o = getWeight(HIDDENS); // Threshold weight
      for (int h = 0; h < HIDDENS; h++) {
         o += hidden[h] * getWeight(h);
      }
      o = Math.tanh(o); // squash between -1 and 1
      return o;
   }


Logged
Ben
Newbie
*
Posts: 24


WWW
« Reply #7 on: November 05, 2010, 12:54:35 PM »

To keep the MFS-plugin working, I applied a workaround for the sorting direction problem in the TopK-operator until bug 437 http://bugs.rapid-i.com/show_bug.cgi?id=437 is fixed

The new version 1.0.1 is available at http://www-ai.cs.tu-dortmund.de/SOFTWARE/MFSPLUGIN/index.html
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
ericflores
Newbie
*
Posts: 1


« Reply #8 on: November 05, 2010, 03:34:59 PM »

Thanks a lot Ben for this extension and the quick patch.
Eric
Logged
Ben
Newbie
*
Posts: 24


WWW
« Reply #9 on: December 08, 2010, 03:09:45 PM »

Hi,

the MicroArray Feature Selection Extension has changed its name and location and features (it's growing)

The latest release 1.0.2 of my Feature Selection Extension can be found at

https://sourceforge.net/projects/rm-featselext/


By migrating to SourceForge the source code is freely available via SVN. So if you have patches/additional features, just pass them over.
Futhermore I added a new feature:

Least Angle Regression (LARS) which is capable of computing LASSO estimates for classification and regression.

Enjoy,
Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1196



WWW
« Reply #10 on: December 09, 2010, 12:10:38 PM »

Great news, thank you Ben!

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Ben
Newbie
*
Posts: 24


WWW
« Reply #11 on: December 13, 2010, 03:16:49 PM »

Hi,

Version 1.0.3 was released at https://sourceforge.net/projects/rm-featselext/

It contains the following changes:
 - Renamed and rearranged operators in RM (I'm trying to adhere to the RapidMiner naming conventions)
 - Changed the caching strategy of the MRMRCriteria.
   Now the user needs to create a MRMRCache-object in RM if he wants to use caching.
   There is no need for clearing the cache by hand.
   -> The MRMRCacheResetter was replaced by a MRMRCacheCreator
 - Added some operator description
 - Dev only: Rearranged operators resource files

BTW, thanks for your feedback

Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Ben
Newbie
*
Posts: 24


WWW
« Reply #12 on: January 03, 2011, 11:48:55 AM »

Version 1.0.4 is out:
  - RFE: added parameter: use_absolute_weights
  - FIXED: Computation of F-test lacked ^2 in enumerator term. Affected all MRMR-operators
  - Added Kuncheva's consistency index for FS stability evaluation
  - Removed deprecated genetic MRMR-FS
  - Added MakePerformanceLoggable-operator

Get it at https://sourceforge.net/projects/rm-featselext/

Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Ben
Newbie
*
Posts: 24


WWW
« Reply #13 on: March 08, 2011, 11:04:04 AM »

Version 1.0.5 is out:

 - Ensemble MRMR can combine rankings
     -> rankings do not have to be calc'd for all features 
     -> Combine Ranks, the first 2k features are ranked, when combining ranks, non ranked features receive a penatly rank of 3k
 - Performance-Logger can now log StdDev, variance and number of entries and can now identiy a single criterion by id or name.
 - Top-K-operator can now also return bottom-k and make use of absolute weights
 - LARS - threshold !=0 now delivers an exact solution with |beta|=t and not the best |beta| <= t
 - BUG FIX in RFE: Attributes were removed even when "remove attributes" was not active
 
Get it at https://sourceforge.net/projects/rm-featselext/

Ben
Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Ben
Newbie
*
Posts: 24


WWW
« Reply #14 on: March 14, 2011, 03:56:09 PM »

SEVERE BUGFIX

Hi all,
as RapidMiner Bug 437 (http://bugs.rapid-i.com/show_bug.cgi?id=437) has been corrected, the sorting direction for AttributeWeights.sortByWeight(...) have changed. This rendered my plugin useless with all new version (I guess >= 5.1.x). I adjusted the code to meet these changes. Sadly, this eliminates downward compatibility. So switch to the latest RM version.  (Now needs RapidMiner 5.1.003 or later!)
This affects: EnsembleFeatureSelection, ExonWelchTest, RFE, RFE-SVM, Selection2Ranking, TopK, Weights2Ranking and MRMR-FS.

If you use the plugin, please upgrade RM and the plugin to the latest version.
As usual, you can download the latest version  (1.0.6) from https://sourceforge.net/projects/rm-featselext/.

Ben

Logged

http://www-ai.cs.tu-dortmund.de/PERSONAL/schowe.html
Artificial Intelligence Group
Technical University of Dortmund
Pages: [1] 2
  Print  
 
Jump to: