Imagine all you would have to do for creating a data mining process was to select a data set and specify what you want to do with the data, e.g. predictive modelling. Wouldn't that save a lot of work?
Within the research project "e-LICO", funded by the EU within the 7th Framework Programme, the Intelligent Discovery Assistant (IDA) was developed, and it does precisely that. It comes with its own perspective (marked with the silhouette of a friendly butler) that contains all you need: The repository and the assistant itself. To use it, follow three simple steps:
Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
Select a goal. The most frequent one is probably "Predictive Modelling". All goals have comments, so you see what they can be used for.
Select "Fetch plans" and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.
The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.
You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL. Alternatively, just download it directly and place it in RapidMiner's lib\plugins folder.
Since the workflow planning happens in Prolog, this extension automatically installs a Prolog engine (XSB Prolog plus Flora 2). It will do so when it first starts. These can only be installed into a specific directory, so you must run RapidMiner as administrator when using the extension for the first time. (On Windows, righ-click and "Run as administrator").
If you try out the extension, we ask you to participate in the user survey so we can keep improving the extension. You can easily open the survey by installing the extension and clicking on the third button in the toolbar (the one with the letter box).
The IDA was developed as a collaboration mainly between the University of Zurich (Jörg-Uwe Kietz and Floarea Serban) and Rapid-I.
Over the years, many of you have been developing new RapidMiner Extensions dedicated to a broad set of topics. Whereas these extensions are easy to install in RapidMiner - just download and place them in the plugins folder - the hard part is to find them in the vastness that is the Internet. Extensions made by ourselves at Rapid-I, on the other hand, are distributed by the update server making them searchable and installable directly inside RapidMiner.
We thought that this was a bit unfair, so we decieded to open up the update server to the public, and not only this, we even gave it a new look and name. The Rapid-I Marketplace is available in beta mode at http://rapidupdate.de:8180/ . You can use the Web interface to browse, comment, and rate the extensions, and you can use the update functionality in RapidMiner by going to the preferences and entering http://rapidupdate.de:8180/UpdateServer/ as the update server URL. (Once the beta test is complete, we will change the port back to 80 so we won't have any firewall problems.)
As an Extension developer, just register with the Marketplace and drop me an email (fischer at rapid-i dot com) so I can give you permissions to upload your own extension. Upload is simple provided you use the standard RapidMiner Extension build process and will boost visibility of your extension.
Looking forward to see many new extensions there soon!
We have already blogged on the RapidMiner Community Extension here and here . The community extension enables you to share your RapidMiner workflows with a large comunity of data miners all over the world on the community platform myExperiment.org.
This can be a great benefit: You can learn about (and from) other people's work, make your own work more visible, get new ideas, and make friends with other data miners. I just made a small video showing how it works. Here it is:
At the RCOMM, we had a challenge in which data miners had to design RapidMiner processes solving unusual tasks. The three tasks were to design a process that creates the lyrics of "99 bottles of beer", apply a model on a data set of which a complete column was lost, and to create a process that computes the Fibonacci numbers. All winning solutions, challenge descriptions, and necessary data preparation processes are now on myExperiment:
I think they are worth looking at since they apply quite some clever tricks.
Furthermore, we have seen a lot of interesting and brand-new RapidMiner Extensions at the conference. One of them, made by the DFKI, assists the data miner in choosing an appropriate learner for their data set and saves you from trying a lot of different learners manually. The extensions is available from our update server and is described here:
RapidMiner 5 comes with a docking framework that allows you to select and move around user interface components in order to design the interface according to your needs. Earlier versions of RapidMiner used to present process results in multiple tabs, simply displaying empty space when no results were generated yet. Since every result tab is a freely movable UI component in RM 5, there is no component which would fill up the free space when no result tabs are present - the UI would simply collapse and neighbouring components would take over the free space. This would clearly be ugly, so we started by adding an empty component serving as a place holder reserving space where new results would be added.
It quickly became clear that having the largest part of the result perspective filled with empty space is not particularly less ugly, so we decided to fill it up with something useful. What would be more obvious than to give a new home to the result history? What do you mean, you don't know the result history? Everyone should know the result history. Well. Admittedly, the old result history did not make it into the top ten of RapidMiner's usability charts, but it has always been a nice feature that noone used.
For RM 5, we designed a completly new result history which looks like this:
As you see, the result history presents an entry for each process execution and lists all results, each presented as a thumbnail or textual represenation. Thus, you can go back in time, look at the results produced by earlier versions of your process, possibly re-open them, compare performances, and restore the particular process version if you find it performing better. Having this history readily available, provides terrific assistance for rapid process design.
This way, what was originally intended to be a place holder became one of my favourite RM 5 features.
As many of you may know by now, we are going to release a new major version of RapidMiner soon. Version 5.0, codename Vega, will come with a whole bunch of new features, some of which I am going to present in more detail in this blog during the next days and weeks.
The first feature I'm presenting here is the new flow layout which Vega will have in addition to the well-known Tree. (I'm not presenting a screenshot, an early version has been posted here.) For us, the Tree has for many years been a powerful companion, representating data mining processes in a neat and compact way. Overseeing processes the size of a cross validation on Iris with a glimpse, the non-expert was easily lost in the implicit data flows defined by the Tree. I would not be surprised if the one or other experienced RapidMiner trouper was unaware of how this data flow was actually defined, so let's recap for a second.
In earlier RapidMiner versions, at any stage during the execution of a process, all data currently available is kept on a stack. An operator being executed grabs hold of the first object of the desired type (example set, model, etc.) and pushes its results back on top of the stack. Thus, if several example sets have been generated, the operator consuming an example set will always use the one generated by the most recently executed operator. Wherever this behaviour did not suit the needs of the process designer, operators had to be used to rearrange the ordering of the stack: IOSelector (bringing an object to the front of the stack), IOConsumer (deleting useless objects), IOMultiplier (making copies of objects), and combinations of IOStorer and IORetriever (sharing objects across far away places of the process). In complicated processes, this can easily clutter up the process making it hard to follow what is going on.
In Vega we will have a flow layout view which makes this data flow explicit: operators will have input ports and output ports which can be connected to each other in an arbitrary fashion (as long as one doesn't construct cycles). Does this mean we have to say goodbye to the good old Tree? No. One of our design goals was to keep the possibility to use the implicit data flow defined by the Tree allowing for rapid process design wherever this is possible. At the same time, we provide the possibility to use the explicit flow layout whenever the tree would require fumbling with stack rearrangement operators.
The default behaviour of RapidMiner 5.0 will be the following: Whenever an operator is added to the process, the data flow will be constructed as to simulate the one that would have been implicit in pre 5.0 versions. Also, in case one ever gets confused, all connections can be deleted and restored in the old-fashioned way by the click of a button. This behaviour can be customized in various ways. From our experiences made so far, we can say that the new flow layout significantly eases process design, inserting operators at the beginning will not mess up the objects at later stages. At every stage it is perfectly clear which object ends up where. This is particularly true in combination with the new meta data transformation, which I am going to present in one of the next posts.
For us, one of the challanges that comes with this new feature is to import old processes into RapidMiner 5.0. In these process files, the data flow is not explicitly stated and it cannot easily be simulated since it may, for a few operators, depend on the actual input data and exists only at execution time. We therefore take the following approach: First, the operators are loaded and left unconnected. Then, a data flow is automatically constructed as described above. Here, the stack rearrangement operators (IOConsumer and IOSelector) are taken into account for the construction of the flow, but they are not connected themselves and deleted afterwards. This imports a huge fraction of the existing processes correctly. One problem left open is the small set of operators for which the number of outputs generated depends on the input data, e.g. iterating operator chains and the ValueIterator. For these, we introduce a new collection I/O-Object type. Importing processes that use these operators may be problematic, but are easy to fix manually.
In fact, the import of processes from older RapidMiner versions is one of the main reasons why we will have an alpha test phase starting in mid August. We seek the support of the community for feedback on which of your old processes are correctly imported, and which aren't. You can register for the alpha test here: