Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
RoombaImage 28 Sep 2010
Speaking about Tracks: The Roomba Vacuum Cleaner by Ingo Mierswa Comment (0)

Last week I stumbled upon the long exposure image of firefly tracks. I like the image and I also would like the idea to analyze if there are common patterns in their flight...

Here is a similar picture for all the people who own a Roomba vacuum cleaner from iRobot - or who consider to buy one. This cleaner tries to reach the complete room automatically:

 

Roomba Vacuum Cleaner

 

As you can see: the Roomba cleaner hits every place in the room - at least better than I probably do. Here we also have more data than in the firefly setting and this allows for a first (human) analysis: it might hit every single spot - but it seems to work quite inefficient. However, I like the circular shape in the left part and I wonder if the left part of the room had actually more dirt since the Roomba stayed longer  there.

NatureFirefliesAnalysis 24 Sep 2010
Firefly Tracks by Ingo Mierswa Comment (0)

A few weeks ago I stumbled upon physicist Kristian Cvecek who took those great pictures of fireflies trails by using a slow shutter speed on his camers. Being a photographer myself I was of course first captured by the images itself:

 

Firefly Tracks

 I wonder if anybody has already analyzed those tracks. Maybe there are some patterns like those known from other species (dance of the bees...)?

Untagged  21 Sep 2010
Prediction challenge with background knowledge on Obstructive Nephropathy by Simon Fischer Comment (0)
Within the e-Lico project, Rapid-I is sponsoring a data mining challenge on Obstructive Nephropathy (ON). The task is particularly challenging since it exhibits the typical charecteristics of data  from the bio domain: high dimensionality versus small sample size, incomplete data, high degree of dependencies, etc. The award includes prize money of €2500.

All details can be found here: http://tunedit.org/challenge/ON

We would be glad to see many RapidMiner users participating in the challenge.

Good luck!
Simon
researchRCOMMRapidMinermyExperimentExtensionschallenge 20 Sep 2010
RCOMM Challenge Processes and Extensions by Simon Fischer Comment (0)

At the RCOMM, we had a challenge in which data miners had to design RapidMiner processes solving unusual tasks. The three tasks were to design a process that creates the lyrics of "99 bottles of beer", apply a model on a data set of which a complete column was lost, and to create a process that computes the Fibonacci numbers. All winning solutions, challenge descriptions, and necessary data preparation processes are now on myExperiment:

http://www.myexperiment.org/search?query=rcomm&type=all

I think they are worth looking at since they apply quite some clever tricks.

Furthermore, we have seen a lot of interesting and brand-new RapidMiner Extensions at the conference. One of them, made by the DFKI, assists the data miner in choosing an appropriate learner for their data set and saves you from trying a lot of different learners manually. The extensions is available from our update server and is described here:

 http://madm.dfki.de/rapidminer/wizard

Try it out!

Plotter 20 Sep 2010
The Future of Statistics: Moving Pictures by Ingo Mierswa Comment (1)

We at Rapid-I definitely believe that appropriate visualizations of data are often much more important than the best models describing those data. For exactly that reason we implemented about 30 high-dimensional plotters which will be discussed in the blog during the next months .

One strong concept which will become more and more important as data analysis is transferred to computers is the concept of animation. Everybody knows that by looking at a 3D plot on a 2D screen. You can not identify the exact location of the data points without moving them a little bit around, for example by slowly rotating the plot. Every user of the 3D plotter of RapidMiner probably knows this phenomenon.

But there is a different and probably more intuitive possibility for using animations, namely in order to express changes over time.  You can try this yourself with the great Gapminder website

 

http://www.gapminder.org/

 

You can select from an amazing amount of socio-economic data about almost all countries  of the world and define nice looking bubble charts as well as animations which show the changes over time.

Check out for example the CO2 emissions since 1820 (example above) or the Wealth & Health of Nations . Click the "Play" button on the lower left in order to start the animation.

A strong plot should be clean and easy and in the best case people can understand what's displayed by just viewing it without any additional information. The Gapminder people implemented this almost perfectly: the user interface is simple and intuitive and still it's powerful, the meaning of every axis as well of every visual concept is explained directly within the plot. Thanks for this great - and interesting - service!

Untagged  16 Sep 2010
RCOMM 2010 - Day 2 by Ingo Mierswa Comment (2)

The second day of RCOMM 2010 was just as exciting than the first one. We started with a great talk of ThomasOtt of http://www.neuralmarkettrends.com about Forecasting Historical Volatility for Option Trading. Thomas is an awesome speaker and it was great listening to him and his experiences about modeling the markets:

 RCOMM 2010

 

By the way, Thomas  has written a nice wrap up of his RCOMM 2010 experience so far at http://www.neuralmarkettrends.com/2010/09/14/rcomm-2010-having-a-blast/. Here is another picture of him answering a question from the audience:

 

RCOMM 2010

 

The second talk in the first session was given by Marin Matijas about a fascinating application domain for RapidMiner, namely the forecasting of load in the energy sector. It was great to see what Marin has already achieved by predicting the necessary amount of energy much better than what was achivieved before.

 The next three talks dealt with aspects of the RapidMiner architecture and how the data analysis with RapidMiner can be improved in terms of memory efficiency and / or runtime. Alexander Arimond showed a solution for distributed data mining based on the Map & Reduce paradigm (for example Hadoop) which a tremendous speed-up up to a factor of 6 for eight machines.

Marco Stolpe showed how a hierachical variant of frequent item sets, namely hierarchical heavy hitters can be implemented in RapidMiner. This should become the starting point for the discussion about how stream mining can be integrated in RapidMiner in general. We will come back to this during the next weeks and I am looking forward to find a solution in collaboration with Marco.

The last architecture talk was given by Olaf Laber from our partner Ingres. He has shown how scalable high speed data mining can be achieved by a combination of  RapidMiner with Ingres VectorWise :

 

RCOMM 2010

Imagine how you learn a decision tree on 10 million records in a couple of seconds only while using less than 1 Gigabyte of memory only. We experienced a speed-up up to 40 for Naive Bayes. Welcome to Ingres VectorWise + RapidAnalytics!

In the workshop session, our head of research & development, Simon Fischer, has shown a life demo of RapidAnalytics and how easily data and processes can be shared or integrated by the means of web services. We got a lot of positive feedback on RapidAnalytics and we will release the Community Edition soon to the general public (please contact us if you want to become a pilot customer):

 

RCOMM 2010

Sebastian Land has then shown the new R extension for RapidMiner.  We got a lot of positive feedback on the extension as well but also hints for the improvement we will surely regard for the second version.

The last session dealt with information and relation extraction. Timur Fayruzov started with a great talk about the extraction of protein interaction. The results were quite impressive and consisted also of a nice web interface having RapidMiner running in the background as engine.

The last talk was then given by Felix Jungermann. He has shown his Information Extraction Plugin which allows for the generic extraction of information from documents, like for example Named Entitiy Recognition. The extension comes with an awesome graphical user interface as well as many new algorithms and I am really looking forward to the release of his extension.

The first RapidMiner Community Meeting and Conference was a complete success. The quality of the talks was far above the average and I met so many lovely people. We had a lot of great discussions and new plans and projects were born as well. Thanks to everyone who participated and I am looking forward to meeting all of you next year again.

RCOMM 14 Sep 2010
RCOMM 2010 - Day 1 by Ingo Mierswa Comment (0)

For the first paragraph I'll make it short: I love it!

After the introduction trainings yesterday, we started with the actual conference today and directly had a lot of really great talks. Katharina Morik began with her invited talk about data mining under constrained resources. This was of course not the first time I heard a talk given by her but as always it was an inspiring experience just listening to Katharina and her visions about what can be expected for data mining taking current and upcoming applications into mind.

In the next session, Kyle Goslin from the ITB in Dublin has presented a cool tutorial wizard tool which can be used to easily design new RapidMiner tutorials which can for example used for lectures. This is probably a great extension helping teachers to use RapidMiner in their data mining courses a lot - so please stay tuned since we will add the Extension soon or directly integrate it in RapidMiner.

The next talk was given by  Christian Kofler from the DFKI in Kaiserslautern and covered a nice integration of landmarking features for meta learning which have been presented by Sarah Abdelmessih later in the afternoon. First, I couldn't believe it but I have seen the first fully integrated (and working!) system for meta learning in my life. The PaREN extension efficiently calculates a set of only four landmarking features and predicts the accuracy of seven learning schemes based on those meta features. Even if the accuracy predictions are not 100% correct by themself, this is incredibly helpful since the ranking was almost perfect. If the accuracy prediction indicate that Naive Bayes is the best way to go, it very likely is the best scheme for the data at hand. Don't try different model types yourself: just ask the PaREN extension in the future. Cool stuff.

Then Floarea Serban from our e-LICO partner (University of Zurich) has presented the workflow planner for intelligent discovery assistance. The goal is quite similar to that of the PaREN Extension but concentrates on the whole data mining process instead of the selection of the optimal learning scheme alone. Again, this was a fantastic demonstration of user support: you simply define the data set and the goal you want to achieve ("discretize all features") and it generates all processes which will solve this task. Great for beginners but also for RapidMiner experts who want to quickly solve routine tasks.

I unfortunately missed the talk by Zoltan Prekopcsak about Cross-Validation: the illusion of reliable performance estimation so I cannot say much about it, sorry. But I heard that it was also a really interesting talk which inspired the participants to a lot of discussion afterwards - and this is almost always a good sign.

Milan Vukiecevic gave a talk about WhiBo, which is like having a mini-RapidMiner within RapidMiner. They divided well known algorithms like decision trees into their components which can now be almost arbitrarily combined. This allows for the easy development of already known algorithms (like the many different decision tree or k-means variants) but also simplifies the detection of new ones. I would love to see a genetic programming approach combining these components automatically for a given data mining task in order to construct the optimal modeling scheme.

Tobias Malbrecht had then shown some small processes for creating reports within RapidMiner.  There were unfortunately some technical issues with the browser, file locking, and the wireless network but I think the participants were able to see the process based reporting style together with a new Portal report generator which will be part of the next release of the Reporting Extension.

The final session ended with a game show "Who wants to be a data miner?" which was hosted by Simon Fischer and Sebastian Land. They challenged the contestants with three well designed tasks: Creating the text of "99 bottles of beer" as example set, impute the values for a missing column, and calculating the Fibonacci numbers with a RapidMiner process. Two well-experienced RapidMiner consultants were not able to solve the third challenge in time - but Matko did great, although he was using RapidMiner for a couple of months only. Congrats to you, Matko!

 Ok, I was really bad at the game show myself but you can't imagine how hard it is to design a recursive process having 50 people watching you while Simon makes funny comments and game show music is playing in the background. However, the great discussions during our dinner tonight helped me to overcome the shame ;-)

I am looking forward to the second day and I expect many more fantastic talks for tomorrow. See you there!

 

RapidMinerPlotter 13 Sep 2010
The RapidMiner Plotters 1: Scatter by Ingo Mierswa Comment (11)

Overview "Scatter"

  • Summary: Showing dependencies between two (three) dimensions
  • Number of Dimensions: 2 plus 1 encoded by color
  • Data Types: Numerical, Nominal, Dates

The plotting facility of RapidMiner is certainly one of its strongest parts. In total, several dozen plotters for data, models, and weights are provided and allow the interactive inspection of your data analysis results. In many data mining projects, an explorative approach is the first step towards the understanding of the data and the problem at hand. The visualization of high-dimensional data sets and models hence is an important part and for exactly this reason we decided to incorporate those powerful plotters into RapidMiner.

We got a lot of requests of RapidMiner users who wonder what is the exact meaning of the options of the  different plotters. Although some are pretty simple and self-explanatory, others are harder to understand. Today we will start a series of blog entries which will describe all available plotters with its options in order to allow for deeper insights into your data and models. From time to time we will add a new blog entry until all plotters are covered here.

The first plotter is one of the most simple ones: the Scatter plot. It might look simple at a first glance, but since several of its options are also part of the other plotters I consider those as very important. First, let's  have a look on the plotter:

 

 

The scatter plot is a simple two-dimensional plot with two axes: x and y. The x-axis is plotted horizontally and the y-axis vertically. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes.

The axes and the data points are plotted on the right part of the plotter, on the left you can see the plotter controls which are located there for all plotters and provide all options for the different types of plotters.

The first option is called Plotter. You can select  the type of the plotter here. In this example, the type is Scatter - we will cover the other plotter types in future posts. Directly below the Plotter-option you will find two boxes where you can select the attribute (variable, dimension) of your data set or model which should be used for the x-Axis and for the y-Axis. Those two options both have to be set, the plotter will not show anything otherwise. By the way, you can use numerical attributes as well as nominal attributes for the axes. Even date attributes are supported!

As you can see, you can also identify if the selected attribute should be transformed on a log scale. Just check the box below the corresponding axis.

The next option is called Color Column. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points. It does not matter if the selected column is numerical or nominal, both scenarios will work. Below you will find an image showing the well-known Iris data set where the class (the label) was chosen as color column:

 

 

You can see the an additional legend is now shown at the top of the plot indicating the meaning of the used colors. In case of a numerical color column, the legend shows the colors together with the minimum and the maximum values.

You might have noticed the Jitter option.  This option is quite useful if several data points are located at the same point in the two-dimensional space. Just move around the jitter slider and look what's happening: the points are moving a bit to a random direction showing if and which points are lying below.

The last two options are pretty simple: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

 

Zooming and Panning

In general, you can zoom into a RapidMiner plotter by dragging a zomming rectangle indicating which part of the plot should bedrawed to a larger scale. The zooming rectangle is indicated by a blueish rectangle like the one in the following picture:

 

Zooming and Panning

 

In order to zoom in, please drag the rectangle from the top left to the bottom right, in order to zoom out, please simply drag a rectangle in the opposite direction, i.e. from the bottom right to somewhere upper left. Simply try this, you will quickly get used to this.

If you have zoomed in, you probably want to move around in order to watch other parts of the plot. This movage in a plot is called panning and can be done by holding down the CTRL key while dragging the mouse. The plot will then be moved into the direction you dragged the mouse to.

 

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

myExperimentExtensionCommunity 9 Sep 2010
50 Processes on myExperiment by Ingo Mierswa Comment (0)

Good news for the users of the RapidMiner Community Extension. Up to now, 50 RapidMiner processes were uploaded to the myExperiment portal and can directly be browsed and downloaded into RapidMiner.

MyExperiment is a community website where people share workflows of various kinds. It is an active community, and the portal comes with all the nice social network features:

 

  myExperiment Portal

 

In one of our previous blog posts , we have described the Community Extension for RapidMiner in detail. The Community  Extension directly connects to myExperiment which means that you can easily upload the process you are currently working on with a single click. The extension also allows to browse RapidMiner processes on myExperiment and download them to your local machine directly from within RapidMiner.

I really like the idea of a data-mining-process-wiki which can serve as a common knowledge source for data analysts worldwide. And I am happy that so many people already wanted to share this knowledge with others, for example this nice process which can be used to replace missing values with other attributes' values:

 

RapidMiner Process on myExperiment

 

More information about how to use the Community Extension can be found at http://www.e-lico.eu/?q=node/226

So you should download the extension from our update- and installation server in the Help menu of RapidMiner, activate the myExperiment view in the View menu and start to up- and download processes. Happy sharing!

 

YouTubeTwitterFacebook 8 Sep 2010
Rapid-I on Facebook, Twitter, YouTube by Ingo Mierswa Comment (0)
Just a short note: you can find Rapid-I together with the latest information about RapidMiner, RapidAnalytics, and our other products as well as about data analysis in general on Facebook , Twitter , and YouTube . See you there!
  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter