Open source software for big data analytics.No programming required.
Rapid-I Blog
Blog Tags
 RapidMiner, Plotter 19 Oct 2010 The RapidMiner Plotters 4: Scatter 3D by Ingo Mierswa Comment (0)
Overview "Scatter 3D"
• Summary: Showing dependencies between three dimensions
• Number of Dimensions: 3
• Data Types: Numerical (dates and nominal values will be printed by their internal values)
This is the next post of a series describing all RapidMiner plotters in detail. The first post has discussed the traditional Scatter plotter, the second post has discussed the Multiple  Scatter plotter, the third one covered the Scatter Matrix. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Scatter 3D plotter, we will again first have a look:

The scatter 3D plot is a simple three-dimensional plot with three axes: x, y, and z. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those three axes.

The axes and the data points are plotted on the right part of the plotter, on the left you can see the plotter controls as usual. They provide all options for the different types of plotters.

The first thing you can probably notice is the fact, that there is a selection list for the z-axis like the one known from the scatter multiple plot . You can select a single dimension like we have done in the example above or multiple dimensions by pressing CTRL while you select additional dimensions. Alternatively, you could also hold down the SHIFT key and select a second dimension which will automatically select the complete interval in between. In the following plot, we have selected the first attribute "att1" as additional dimension used for the z-axis:

As you can see, both selected dimensions are now plotted on the z-axis against their common x- and y-axis. In order to identify the different dimensions, RapidMiner uses different colors for each z-dimension as indicated in the legend at the top.

The 3D plots are somewhat special in terms of additional options. In general , you can zoom into RapidMiner plotters by dragging a zooming rectangle and you pan, i.e. move around in a plot by holding down the CTRL key while dragging the mouse. For the 3D plots in RapidMiner, however, you have three different modes: Panning, Zooming, and Rotating.

Those different modes are indicated by the icons in the lower part of the options. If you select the first mode, indicated by a lense with arrows, you will change into the  Panning mode. Now you can drag the mouse and this will move the complete plot across the screen.

The second icon indicates the Zooming mode, where you can select a zooming region by dragging a rectangle or you can simply use the mouse wheel for zooming in and out.

The third icond indicates the Rotation mode, which is the default. Dragging the mouse will now not zoom or pan the plot but rotate it. Simply try it and you will quickly get used to it.

If you have have the feeling that you want to go back to the default view, the fourth icon will help: this one will reset all views and axes to the initial default. And there is one last option, called Set Scales, which will bring up a dialog which allows to rename the axes, apply logarithmic scales or adapt the ranges for the axes:

Of course, the scatter 3D plotter also provides the Export Image option as all other RapidMiner plotters.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 Community 14 Oct 2010 More than 200 people like RapidMiner on Facebook by Ingo Mierswa Comment (0)

Since our start on Facebook a couple of weeks ago, more than 200 people stated on Facebook that they like RapidMiner, the company Rapid-I, or other Rapid-I products. We highly appreciate that and we promise to do our best to further improve RapidMiner for our community!

This blog is another initiative to give you the latest updates fresh from the Rapid-I labs and to have a place where we can discuss topics related to RapidMiner or data analysis in general.

And last but not least the community meeting RCOMM 2010 was a great success! Personally, I was surprised about the many great initiatives coming from the RapidMiner community and I am happy to see all the great new things which have already been published or are about to come. Thanks everybody!

Here you can find our Facebook page and we are happy if you become a part of our community by becoming a fan!

 RapidMiner, Process, Optimization 13 Oct 2010 Finding Optimal Operators and Subprocesses - Without PaREn (aka "The Naive Way") by Ingo Mierswa Comment (0)

We had a discussion recently in the forum about the new PaREn extension for RapidMiner, in particular if the functionality behind the PaREn extension is something which is also part of other data mining  solutions.

The PaREn Automatic System Construction Wizard is a tool for supporting you in constructing a classification process within RapidMiner. For a given data set, it automatically recommends, constructs, and optimizes a classification process based on basic characteristics of the data set. You select a data set and the PaREn extension analyzes the data and predicts the expected accuracy for a set of widely used data mining algorithms.

One of the readers in the forum compared this to the SPSS function where a set of different models is tested on a data set and the best model is automatically chosen. But this is quite a difference: in SPSS, all models are actually tested! This can, by the way, also be done with the PaREn extension during the evaluation step and it is also with a simple process as Simon has pointed out in the disucssion. And exactly this process will be shown below ;-)

The cool thing about the PaREn extension is that it predicts which model is probably the best even without any testing. This is the first time I have actually seen this meta learning approach really working and this is probably the reason why we at Rapid-I and many others love it. Kudos to Christian and the team of the DFKI for this great extension!

Ok, back to the promise that the simple approach done by SPSS is of course also possible with RapidMiner. The following process employs this more manual approach. The combination of the operator "Operator Enabler" with a grid parameter search can be used to enable / disable operators or complete subprocesses easily. For example, you could use this combination to try different model types on a given data set. In the process below, we use it to identify the difference between using normalization and skipping it before a nearest neighbors classifier:

If you use different learning schemes like let's say Naive Bayes, Decision Trees or Linear Regression instead, you would end up with exactly the same "we just try different modeling techniques" approach like the one known from SPSS. Of course the PaREn extension is much cooler but this manual approach offers a great advantage over the extension and other solutions: you can specify all different steps like the evaluation scheme as usual.

The complete process can be downloaded with our Community Extension . The name of the process is "Automatical Disabling / Enabling of Operators or Subprocesses".

Have fun!

 challenge 12 Oct 2010 2500 Euro Prize Money for Data Mining Challenge by Ingo Mierswa Comment (0)
 Within the e-LICO project, Rapid-I is sponsoring a data mining challenge on Obstructive Nephropathy (ON). The task is particularly challenging since it exhibits the typical charecteristics of data from the bio domain: high dimensionality versus small sample size, incomplete data, and a high degree of dependencies. The challenge is open to all participants now and the award includes a prize money of Euro 2500.

We encourage all users of RapidMiner to participate in this challenge - chances are high to come up with a good solution! The challenge ends on December 19th, 2010. Please find more information on the challenge web page at

http://tunedit.org/challenge/ON
 RapidMiner, Plotter 7 Oct 2010 The RapidMiner Plotters 3: Scatter Matrix by Ingo Mierswa Comment (4)

Overview "Scatter Matrix"

• Summary: Showing dependencies between all pairs of dimensions
• Number of Dimensions: Unlimited, 1 encoded by color, useful for up to about 50
• Data Types: Numerical, Nominal, Dates

This is the third post of a series describing all RapidMiner plotters in detail. The first post has discussed the traditional Scatter plotter, the second post has discussed the Multiple  Scatter plotter . Since many options and controls of this simple plotter are also relevant for the Scatter Matrix variant discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Scatter Matrix plotter, we will again first have a look:

As you can easily see, the Scatter Matrix is basically a set of Scatter plotters, one for each combination of two dimensions. In the first row, each plotter has the first attribute for the y-Axis for the different possibilities of the x-Axis. In the second row, the second attribute is used for the y-Axis and so on. Of course the combinations with the same attribute for the x-Axis and the y-Axis are left out since all points would simply build a diagonal.

Similar to the Scatter plotter, you can (and have to) select an attribute for the option Plots which is used for defining the color of each data point. Both numerical and nominal attributes are allowed for this. The possible values are shown in a legend at the top of the plot.

Just as for the Scatter plot, the Scatter Matrix also offers a Jitter option.  This option is quite useful if several data points are located at the same point in the two-dimensional space. Just move around the jitter slider and look what's happening: the points are moving a bit to a random direction showing if and which points are lying below.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 RapidMiner, Operator 5 Oct 2010 Extended Operations for Nominal Values by Ingo Mierswa Comment (0)

One of the next versions of RapidMiner (5.0.011 or the upcoming version 5.1) will provide a nice extension of the expression parser which is for example used for the operator "Generate Attributes".  The operations are performed with the operator "Generate Attributes" and can be used directly within the expressions for the new attributes.

The supported functions include

• Number to String [str(x)],
• String to Number [parse(text)],
• Substring [cut(text, start, length)],
• Concatenation [concat(text1, text2, text3...)],
• Replace [replace(text, what, by)],
• Replace All [replaceAll(text, what, by)],
• To lower case [lower(text)],
• To upper case [upper(text)],
• First position of string in text [index(text, string)],
• Length [length(text)],
• Character at position pos in text [char(text, pos)],
• Compare [compare(text1, text2)],
• Contains string in text [contains(text, string)],
• Equals [equals(text1, text2)],
• Starts with string [starts(text, string)],
• Ends with string [ends(text, string)],
• Matches with regular expression exp [matches(text, exp)],
• Suffix of length [suffix(text, length)],
• Prefix of length [prefix(text, length)],
• Trim (remove leading and trailing whitespace) [trim(text)].

It is amazing how many new data transformations you can perform with this simple set of text operations. Actually, I often had to use the operator "Execute Script" for this type of operations which is now no longer necessary.

I have also just uploaded a process on myExperiment , which can be directly downloaded with our Community Extension (but of course you will need the RapidMiner update first ;-) ). The process is named "Extended Operations for Nominal Values" - just like this blog entry.

 RCOMM 4 Oct 2010 RCOMM 2010: Proceedings online! by Ingo Mierswa Comment (0)

We have compiled the proceedings of the RCOMM 2010 and they are now available as free white paper in our online shop. Just add the conference proceedings to your cart and checkout.

Here is the link to the RCOMM 2010 proceedings .

You can find the link to the proceedings as well as a link to pictures and RCOMM 2010 reviews also on the RCOMM 2010 web site .

 RapidMiner, Extensions 1 Oct 2010 New Extension: PaREn - The End of KXEN ;-) by Ingo Mierswa Comment (0)

We just have published a new extension called PaREn which can now be downloaded via our update server. The PaREn Automatic System Construction Wizard is a tool for supporting you in constructing a classification process within RapidMiner. For a given data set, it automatically recommends and constructs a classification process based on certain characteristics of the data set.

I did not believe that this actually works at first - but then I have seen the extension in action and it was amazing. You select a data set and the PaREn extension analyzes the data and predicts the expected accuracy  for a set of widely used data mining algorithms. Althoug the prediction is not 100% correct - the ranking most often is and that's the important part. An additional single click directly created the modeling process where even the parameters are already optimized.

This is actually the first third party extension we offer through our update server. And we love it! It was written by a team of great people of the DFKI (Deutsches Forschungszentrum für künstliche Intelligenz; German Research Center for Artificial Intelligence) . Thanks guys for a great addition to RapidMiner!

Thomas Ott of Neural Market Trends has made a nice video explaining the extension (an explaining web page can be found on the DFKI web site ):

I am looking forward to other great extensions which will be certainly published during the next weeks and months. A lot of promision work has already  be presented at the RCOMM 2010!

 RapidMiner, Applications 1 Oct 2010 Finding Jobs with RapidMiner! by Ingo Mierswa Comment (0)

Today we want to show you a nice offer of Vault Analytics built with the help of RapidMiner:

"One way that RapidMiner’s capabilities can be applied to save hundreds of hours is by applying it to a job search. Finding good job opportunities is often a difficult process, and with more and more people being forced into the job hunt, Vault has seen a need and developed a custom RapidMiner algorithm that quickly does the searching for you and delivers hundreds of relevant jobs straight to you."

On their web site, you can directly test their offer . I didn't try it myself but I certainly will do this if I ever search a new job...

 RapidMiner, Plotter 29 Sep 2010 The RapidMiner Plotters 2: Scatter Multiple by Ingo Mierswa Comment (1)

Overview "Scatter Multiple"

• Summary: Showing dependencies between a set of dimensions against a single one
• Number of Dimensions: Unlimited, useful for up to about 30 simultaneously
• Data Types: Numerical, Nominal, Dates

This is the second post of a series describing all RapidMiner plotters in detail. The first post has discussed the traditional Scatter plotter . Since many options and controls of this simple plotter are also relevant for the Scatter Multiple variant discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Scatter Multiple plotter, we will again first have a look:

As you can easily see, the multiple variant is quite similar to the usual Scatter plotter. But there are two important differences: first, you can now not only choose one but also several  attributes for the y-Axis. All available dimensions are shown in the list on the left and you can simply select one by clicking it or multiple dimensions by holding the CTRL key while clicking.

The result will show all selected dimensions simultaneously by using different colors for each selected attribute. In the example above, we selected the actual label (blue) together with  the prediction of a modeling scheme (red). By the way, the colors are choosen automatically depending on the definition of the minimum and maximum colors in the Property Dialog of RapidMiner.

The other options on the left are quite similar to those of the usual  Scatter plotter. But there is a new and different option, namely the button stating Points and Lines...

By pressing this button, a dialog will appear and asks if the data points should be represented by graphical dots, by lines, or by both:

You can perform the selection for each attribute individually or select points or lines for all dimensions automatically by pressing the buttons on the bottom left. After making your selection just press ok and watch the result:

As you can see, the actual label values (blue) are now represented by dots while we selected a line only mode for the predictions.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series: