I had the pleasure of taking the RapidMiner training course with Ralf in New York last week. It was very much worthwhile, and I learned a lot about how to use RapidMiner more effectively. I also came across several places where I thought RM could be improved, or where it might not be working as intended. Some of these might already be in the plans for RM 4.3 or 5.0, but I'm going to list everything that I noticed that would be something we would be definitely be interested in seeing.
1) Option to keep each input to a node -- right now some nodes have a "keep" option which allows you to preserve one of the inputs, such as an example set. However, this seems to be inconsistently offered across different nodes, and in some cases only some inputs can be preserved, but not others. While this can be worked around with the IOMultiplier, it would be nice if you had the option to keep any of the otherwise-consumed inputs on each node.
2) Improve representation of constants in mathematical expressions. Having to use "const()" instead of "1" is a pain to remember.
3) In a process chain, have an indicator that shows whether operator has a comment or not. Right now there is no way to tell from the GUI which nodes have comments without inspecting each one.
4) I would like to see more formatting options for graphs, such as being able to select the font, point size, number format, etc.
5) Rearrangeable/dockable UI components. This is admittedly a bigger project, but I would like to be able to rearrange the components of the GUI more. Particularly with multiple monitor setups, the flexibility to move seldom referenced components (like the memory monitor) off the main working area, but still be visible with a glance, would be nice. Also, I'd like to have the option to see both the process chain and the results at the same time, and right now I have to choose one view or the other.
6) I've mentioned this in other threads on the forum, but I use Evolutionary Weighting a lot, and I'd like to be able to:
(a) start Evolutionary Weighting using an initial weight vector (possibly from a previous run, or from a simpler model to get a good starting point),
(b) be able to manually pause an Evolutionary Weighting node and get the current values for the weights (or write them to a file)
(c) be able to extract weights from Evo Wgt periodically during a run, so I have a "best known" values if I have to terminate the model run, or if the system crashes.
These ideas work together. In the case of (b) or (c), the saved values of the weights can be passed to the model as initial values for the next model run if (a) is implemented.
7) Have a random selection learner, which for classification problems guesses one of the label values, in the proportion they were present in the training set. This is a "base case" simple learner, so you can compare other learners to see how much better they are than random guessing.
8 ) Add the ability to do 'stratified' sampling across continuous numerical labels. What I'm looking for is something that will guarantee that I have representatives in the sample from the rare portions of the label's distribution. In my case, I'm most interested in predicting values that are at the high end of the distribution, which occur infrequently, but are of the most value. I want to make sure that they are included when I take a sample of the data to do training on.
9) Allow objects to be named and referenced with IOMultiplier, IOSelector, etc, rather than just index number (hard to keep track of)
10) Model Merge -- be able to combine two models, such as a preprocessing model and a learner model, into a single model object that can be applied later on, rather than having to maintain two model objects.
11) FeatureIterator does not appear to deliver its results (i.e. Inside FeatureIterator, if you build a seprate model for each feature, those models aren't returned coming out of the iterator).
12) It ParameterIteration, allow sets of scenarios to be specified to iterate over. where each scenario contains a list of parameter values to set for that scenario
e.g. a=1, stat=hr; a=2, stat=ubb, etc. Helps with meta control (output file names).
e.g. I have three labels I want to predict, using MultipleLabelIterator. They are renamed label_1, label_2, label_3, but I want to write the output files with the original attribute names.
Scenario 1: att=1, attname="revenue"
Scenario 2: att=2, attname="cost"
Scenario 3: att=3, attname="profit"
or I have a process chain that is going to test a KNN model, and a SVM model, each with 3 different parameters. I need to specify different parameters for each one, and the others don't make sense for the other model. Thus, I want to test (with model #1 = KNN, model #2 = SVM):
Scenario 1: modelnum=1, k=1
Scenario 2: modelnum=1, k=5
Scenario 3: modelnum=1, k=10
Scenario 4: modelnum=2, C=1.0
Scenario 5: modelnum=2, C=2.0
Scenario 6: modelnum=2, C=5.0
This approach could also used in grid parameter optimization. Right now it does all combinations of the parameters, even if they don't make logical sense (i.e. testing "modelnum=1, k=1, C=1.0" and "modelnum=1, k=1, C=2.0" in the example above.
13) Have an option to explicitly list values of default options in XML file, rather than only printing values that differ from defaults. This makes it easier to modify later, as well as making it more self-documenting. The current approach could still be offered as an option to minimize the size of the XML files.
14) It would be nice to have a way to capture the underlying data from the ROC plots as a CSV. Right now you can only view plots, can't actually get the computed data used to build them other than in RM's XML format if you save the object.
15) The correlation matrix should allow you to create both the matrix and attribute weights (not either/or, which is the current behavior)
16) In ProcessLog, show what performance measure is actually being tracked, rather than the generic "Performance" title. It can be hard to remember which measure was selected, and which direction is "good" without inspecting the process.
17) I think there's a bug in ExampleSetWriter -- I have a dataset (read from a database) where if I write using ExampleSetWriter, it fails when I try to read it in using ExampleSource. I think the problem is that there are nominal values in my data that are just whitespace, and from inspecting the output file, it looks like such values don't get quoted when written. I can work around this by using CSVExampleWriter instead, but it seems like ExampleSource should be able to read whatever ExampleSetWriter saves.
18) In much the same way that you allow the export of graphics to GNUPlot, I'd like to be able to write them in a way that could be manipulated with R, which is the other main open-source statistics package I use. I could then ensure that all the charts that I produce have a consistent look and feel to them.
RapidMiner is a great product, and I feel like I have a much better grasp on its capabilities having taken the course. Hopefully, some of these suggestions will be useful in helping to guide the future development and make it even better.