Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
Tag >> flow layout
VegaRapidMinerflow layout 26 Aug 2009
Approaching Vega (Epsiode III: Flow vs. Tree) by Ingo Mierswa Comment (1)

Today I loaded an old process I once designed as an example for one of our customers. The process is not too complicated and only consists of a few operators. In order to test the import mechanism of the alpha version of Vega, I first loaded the process in RapidMiner 4.5 and checked the process setup and the results. Here is what the process looks like as operator tree (the image was taken from RapidMiner 5):

 This process seems to be pretty linear, right? Of course not as all experienced RapidMiner users notice at once. The process setup as a tree only looks quite linear, but the internal result stack (read the entry Simon has posted some days ago) and the two IO multipliers make things a bit more complicated.

The next thing I did was to import this process to RapidMiner 5 and had a look at the process  in the new flow view. Here is the result:

 I only rearranged the locations of some operators and exported the picture above. After 8 years of being a hardliner in defending the operator tree + result stack idea for process design, I got the feeling (again ;-) that this flow layout with the explicit data flows might be much easier to understand. In particular, this is probably true for non-computer-scientists which are not used to concepts like stacks and trees.

 Same process, same results. Although I still like the tree and sometimes (as Simon has pointed out) it is still necessary in order to define the order of independent subprocesses, I am really impressed by the importing capabilities of RapidMiner 5 and the nice look of the graph and I hope that this makes process design much easier  - especially for less experienced users.

And what about efficiency in process design? How does the flow layout compares to the tree in this respect? Well, here the meta data transformation Simon has described is a big help. Unless you turn this feature off, all new operators are automatically wired according to fitting meta data descriptions of the connection ports. So in most cases, you still only have to drag the operator to the right position and RapidMiner does the connection itself. So the effort is about the same as for the tree.

 Clear design, explicit flows, same effort. Looks to me that the new flow design will turn out to become the winner of the challenge "flow vs.  tree".

meta dataflow layout 14 Aug 2009
Approaching Vega (Episode II: Meta Data) by Simon Fischer Comment (0)

Those of you who are using RapidMiner for some time probably came across the "Validate process" button. Pressing this button results in some sanity checks and a dry run of the process which passes around dummy objects to see whether all operators receive the correct input. Whereas this was a helpful feature to check for gross errors in the process setup, it is but a fraction of what RapidMiner 5.0 will offer.

From RapidMiner 5.0 on, the process validation will be much more powerful and detailed. First off, it is no longer necessary to jump to a breakpoint to see what data will arrive at a certain operator. Since we have an explicit data flow, we can easily check where the data assigned to the input of an operator comes from, and which operators it went through up to here.

More importantly, operators also provide much more detailed information about pre- and postconditions. Learners specify what kind of input data and label type they can handle (nominal vs. numeric), whether they can deal with missing values, etc. Preprocessing operators know how they transform the data and annotate their results accordingly.

E.g. one common pitfall was to  create a process for regression learning containing an SVM, but forgetting to set the kernel type to a regression kernel.

 

The above screenshot shows what happens in Vega. The data is loaded from a repository, where it is stored together with meta information about attribute types, statistics, etc. The Normalization operator transforms this meta information: The range of the selected attribute is changed to the interval [0,1]. Finally, the SVM checks its input and reports that it cannot handle a numerical label for the C-SVC kernel type.

All this information is updated on a click. As soon as the kernel type is changed, or the problem is solved in a different way, e.g. by discretizing the data, the error vanishes. All these repair options are offered to the user as quick fixes.

VegaRapidMinerflow layoutalpha 31 Jul 2009
Approaching Vega (Episode I: The flow layout) by Simon Fischer Comment (39)

As many of you may know by now, we are going to release a new major version of RapidMiner soon. Version 5.0, codename Vega, will come with a whole bunch of new features, some of which I am going to present in more detail in this blog during the next days and weeks.

The first feature  I'm presenting here is the new flow layout which Vega will have in addition to the well-known Tree. (I'm not presenting a screenshot, an early version has been posted here.) For us, the Tree has for many years been a powerful companion, representating data mining processes in a neat and compact way. Overseeing processes the size of a cross validation on Iris with a glimpse, the non-expert was easily lost in the implicit data flows defined by the Tree. I would not be surprised if the one or other experienced RapidMiner trouper was unaware of how this data flow was actually defined, so let's recap for a second.

In earlier RapidMiner versions, at any stage during the execution of a process, all data currently available is kept on a stack. An operator being executed grabs hold of the first object of the desired type (example set, model, etc.) and pushes its results back on top of the stack. Thus, if several example sets have been generated, the operator consuming an example set will always use the one generated by the most recently executed operator. Wherever this behaviour did not suit the needs of the process designer, operators had to be used to rearrange the ordering of the stack: IOSelector (bringing an object to the front of the stack), IOConsumer (deleting useless objects), IOMultiplier (making copies of objects), and combinations of IOStorer and IORetriever (sharing objects across far away places of the process). In complicated processes, this can easily clutter up the process making it hard to follow what is going on.

In Vega we will have a flow layout view which makes this data flow explicit: operators will have input ports and output ports which can be connected to each other in an arbitrary fashion (as long as one doesn't construct cycles). Does this mean we have to say goodbye to the good old Tree? No. One of our design goals was to keep the possibility to use the implicit data flow defined by the Tree allowing for rapid process design wherever this is possible. At the same time, we provide the possibility to use the explicit flow layout whenever the tree would require fumbling with stack rearrangement operators.

The default behaviour of RapidMiner 5.0 will be the following: Whenever an operator is added to the process, the data flow will be constructed as to simulate the one that would have been implicit in pre 5.0 versions. Also, in case one ever gets confused, all connections can be deleted and restored in the old-fashioned way by the click of a button. This behaviour can be customized in various ways. From our experiences made so far, we can say that the new flow layout significantly eases process design, inserting operators at the beginning will not mess up the objects at later stages. At every stage it is perfectly clear which object ends up where. This is particularly true in combination with the new meta data transformation, which I am going to present in one of the next posts. 

For us, one of the challanges that comes with this new feature is to import old processes into RapidMiner 5.0. In these process files, the data flow is not explicitly stated and it cannot easily be simulated since it may, for a few operators, depend on the actual input data and exists only at execution time. We therefore take the following approach: First, the operators are loaded and left unconnected. Then, a data flow is automatically constructed as described above. Here, the stack rearrangement operators (IOConsumer and IOSelector) are taken into account for the construction of the flow, but they are not connected themselves and deleted afterwards. This imports a huge fraction of the existing processes correctly. One problem left open is the small set of operators for which the number of outputs generated depends on the input data, e.g. iterating operator chains and the ValueIterator. For these, we introduce a new collection I/O-Object type. Importing processes that use these operators may be problematic, but are easy to fix manually.

In fact, the import of processes from older RapidMiner versions is one of the main reasons why we will have an alpha test phase starting in mid August.  We seek the support of the community for feedback on which of your old processes are correctly imported, and which aren't. You can register for the alpha test here:

   http://rapid-i.com/content/view/150/177/lang,en/

 Any feedback is greatly appreciated. If you ever wanted to contribute to how RapidMiner is evolving, this is the time to do.

  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter