As many of you may know by now, we are going to release a new major version of RapidMiner soon. Version 5.0, codename Vega, will come with a whole bunch of new features, some of which I am going to present in more detail in this blog during the next days and weeks.
The first feature I'm presenting here is the new flow layout which Vega will have in addition to the well-known Tree. (I'm not presenting a screenshot, an early version has been posted here.) For us, the Tree has for many years been a powerful companion, representating data mining processes in a neat and compact way. Overseeing processes the size of a cross validation on Iris with a glimpse, the non-expert was easily lost in the implicit data flows defined by the Tree. I would not be surprised if the one or other experienced RapidMiner trouper was unaware of how this data flow was actually defined, so let's recap for a second.
In earlier RapidMiner versions, at any stage during the execution of a process, all data currently available is kept on a stack. An operator being executed grabs hold of the first object of the desired type (example set, model, etc.) and pushes its results back on top of the stack. Thus, if several example sets have been generated, the operator consuming an example set will always use the one generated by the most recently executed operator. Wherever this behaviour did not suit the needs of the process designer, operators had to be used to rearrange the ordering of the stack: IOSelector (bringing an object to the front of the stack), IOConsumer (deleting useless objects), IOMultiplier (making copies of objects), and combinations of IOStorer and IORetriever (sharing objects across far away places of the process). In complicated processes, this can easily clutter up the process making it hard to follow what is going on.
In Vega we will have a flow layout view which makes this data flow explicit: operators will have input ports and output ports which can be connected to each other in an arbitrary fashion (as long as one doesn't construct cycles). Does this mean we have to say goodbye to the good old Tree? No. One of our design goals was to keep the possibility to use the implicit data flow defined by the Tree allowing for rapid process design wherever this is possible. At the same time, we provide the possibility to use the explicit flow layout whenever the tree would require fumbling with stack rearrangement operators.
The default behaviour of RapidMiner 5.0 will be the following: Whenever an operator is added to the process, the data flow will be constructed as to simulate the one that would have been implicit in pre 5.0 versions. Also, in case one ever gets confused, all connections can be deleted and restored in the old-fashioned way by the click of a button. This behaviour can be customized in various ways. From our experiences made so far, we can say that the new flow layout significantly eases process design, inserting operators at the beginning will not mess up the objects at later stages. At every stage it is perfectly clear which object ends up where. This is particularly true in combination with the new meta data transformation, which I am going to present in one of the next posts.
For us, one of the challanges that comes with this new feature is to import old processes into RapidMiner 5.0. In these process files, the data flow is not explicitly stated and it cannot easily be simulated since it may, for a few operators, depend on the actual input data and exists only at execution time. We therefore take the following approach: First, the operators are loaded and left unconnected. Then, a data flow is automatically constructed as described above. Here, the stack rearrangement operators (IOConsumer and IOSelector) are taken into account for the construction of the flow, but they are not connected themselves and deleted afterwards. This imports a huge fraction of the existing processes correctly. One problem left open is the small set of operators for which the number of outputs generated depends on the input data, e.g. iterating operator chains and the ValueIterator. For these, we introduce a new collection I/O-Object type. Importing processes that use these operators may be problematic, but are easy to fix manually.
In fact, the import of processes from older RapidMiner versions is one of the main reasons why we will have an alpha test phase starting in mid August. We seek the support of the community for feedback on which of your old processes are correctly imported, and which aren't. You can register for the alpha test here:
Any feedback is greatly appreciated. If you ever wanted to contribute to how RapidMiner is evolving, this is the time to do.