Open source software for big data analytics.No programming required.
Rapid-I Blog
Blog Tags
 Fun 29 Dec 2010 Christmas Tree 2.0 by Ingo Mierswa Comment (0)

Maybe next year:

The video shows some "Christmas trees" built from ferrofluids in dynamic magnetic fields:

http://en.wikipedia.org/wiki/Ferrofluid

Fascinating stuff.

 Plotter 22 Dec 2010 The RapidMiner Plotters 11: Survey by Ingo Mierswa Comment (0)

Overview "Survey"

• Summary: Compact visualization of high-dimensional data sets; often shows correlations quite well
• Number of Dimensions: unlimited in theory, but useful up several hundred depending on the screen resolution
• Data Types: Numerical, Nominal, Dates

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the Survey plotter, we will first have a look:

The basic idea behind a survey plot is not widely known but it is actually pretty simple: the plot consists of n rectangular areas if you have n dimensions, each area represents one of the dimensions. Similar to a parallel plot , each area is placed beside the other dimensions but in the case of a survey plot, the axis is placed horizontally instead of vertically.

Each data point is then printed like in a bar chart (covered later in this series but probably widely known). A line is used to represent the data for each dimension, with its length proportional to the dimensional value it represents. The complete data point is then represented by all line segments in the same row.

In the example above, you can see about 200 examples, one row consisting of short horizontal line segments for each of those examples. Those short line segments represent the values of those 200 examples for each of the about 60 attributes (dimensions). This visualization is very compact and works well even for high numbers of dimensions and thousands of examples. One of the first things you can easily see is the distribution of values in the different dimensions. For example, the center part has much more high values compared to the attributes on the left or those on the right. This is especially useful if the attributes are ordered like, for example, it is often the case for series data.

The second advantage of the survey plot is that it quickly gives some insight to the correlation between any two dimensions. This is even easier if you sort the data to one or several of the dimensions. And this is where the options on the left become important.

Especially for classification data sets, the last option named Color is probably the most important. Here you can select one of the dimensions and define a color scheme based on the selected attribute’s values. The result might look like this:

If you look closely, you can probably already see that for some of the attributes (attribute_10, attribute_11, and attribute_12 for example) the distribution differs a lot between the two classes. For example, the values of attribute 12 are in general higher for one of the classes. You can even make this effect stronger by sorting the data. The survey plot in RapidMiner offers a sorting for up to three dimensions by selecting them with the parameter boxes named First column, Second column, and Third column. The following picture shows the same data set first sorted according to attribute_12 and then according to attribute_45, i.e. in cases where the values of attribute_12 are equal, the second attribute determines the order:

Now you can see several things: there are other attributes highly correlated to attribute_12, which is now placed as the first dimension in the plot. You can identify those correlated attributes easily since they are also (almost) sorted now. Check for example attribute_10 or attribute_11. By the way, the name of an attribute is shown simply by clicking on one of the graphical columns.

When color is used for different classifications as it is the case here, one can sometimes see by using a sort which dimensions are best at classifying data. For example, the combination of attribute 12 and 45 already defines colorized bands of the same class, i.e. a classification scheme should be able to define those regions and identify those with a high probability for one of the classes.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Other parts of the plotter series:

 RapidMiner, KDnuggets 21 Dec 2010 RapidMiner among the Top Searches at KDnuggets by Ingo Mierswa Comment (0)

Many readers probably know the data mining portal KDnuggets:

If you don't know it, you definitely should give it a try. Although the site design might look a bit cramped at a first sight, you will realize that this is simply due to the fact that you can find literarilly everything about data mining on KDnuggets.

In my opinion, KDnuggets actually is the top resource for data analysts worldwide and you can find there a lot of information about data mining and analytics software, job offers, analysis consulting, training courses, and much more. KDnuggets is also well known for their yearly poll about used data analysis software in which RapidMiner together with R was really successful during the last couple of years.

Today, I found another interesting news on KDnuggets about the statistics of the site search . The search terms “rapidminer” / “rapid miner” are now among the top search terms on KDnuggets and - again together with R - are also among those terms which grew most in 2010.

Those are good news and we are really happy that so many people around the world appreciate the work we have put into RapidMiner and the other data analysis solutions of Rapid-I. Thanks for that!

 RapidMiner, RapidAnalytics, Business Analytics 20 Dec 2010 RapidAnalytics 1.0 and RapidMiner 5.1 released! by Ingo Mierswa Comment (0)

We have just released our new open source business analytics server RapidAnalytics as well as a new version of RapidMiner:

We have discussed RapidAnalytics and its benefits here in the blog before , so I just want to give you again the link to a short intro video showing some of its features:

 R, Example 15 Dec 2010 Simple Example for R in RapidMiner by Ingo Mierswa Comment (0)

We got a lot of positive feedback after the release of the R extension , which allows the integration of R scripts directly in the analysis processes of RapidMiner. Many people really like this approach and for exactly that reason I would like to ease the first steps for those of you who are less experienced in programming in general and programming with R.

The following example performs probably one of the simplest data transformations you can think of: we want to use R to add two columns of a data set and store the results in a new column called “sum”.

Of course it is even simpler to use a special operator for this task, namely the operator “Generate Attributes”. However, the process below should be simple enough in order to demonstrate some of the necessary R concepts for less experienced users. In a programming lesson, the example below would probably be called “Hello World” example for R in RapidMiner.

Of course you will need a correctly installed R extension in order to be able to follow this short tutorial. Please refer to our forum if you have any problems during the installation. Ok, let’s start. We assume we have a data set with four columns named a1 to a4 and another special attribute, the label. We take this input from our RapidMiner repository which is the first step in the process below:

After loading the data with “Retrieve” we simply add a new operator “Execute Script (R)” and connect the output port of Retrieve delivering the data set during execution with the input port of the new operator. We now define the inputs of the script by clicking on the parameter button “inputs” which will open the following dialog:

We define the first input (we only have one) by giving it the name “data”. You can reference the delivered data set then in the script by using this name.
The second definition is the R script itself. Click on the parameter button “script” in order to open a dialog where you can enter an arbitrary R script. This dialog looks like the following one:

Here is what the script does:

Line 1: sum_column <- data[1] + data[2]
This line generates a new data vector named “sum_column” and calculates the sum of the first column of data – indicated by the 1 in brackets – with the second one. Please note that we have used the defined name “data” here.

Line 2: complete_data <- c(data, sum_column)
We now concatenate (command: c) the newly generated column “sum_column” with the given data set named “data” and store it under the name “complete_data”.

Line 3: result <- as.data.frame(complete_data)
We now transform the result into a data frame. Data frames are the R concept for data tables or matrices which can consist of columns of mixed types which can also have a name. They are pretty similar to the Example Sets known from RapidMiner. Please note that you have to transform your results to data frames with the command “as.data.frame” if you want to deliver the results back to RapidMiner as an Example Set (see below).

Line 4: colnames(result)[6] = "Sum"
This last step is optional and simply renames the new column to “Sum”. Of course this could also be done afterward with the operator “Rename”.

The final step is to define the results and how they are delivered back to RapidMiner. Simply click on the parameter button named “results” and the following dialog will be shown:

Here you can define which variables used in the script should be delivered. In our case it should only be the variable “result” which contains the resulting data set. If the variables are a data frame (see above), you could directly transform it to a RapidMiner Data Table / Example Set. Otherwise, you can only deliver a generic R result.

There you go. Now you can simply run the process and add two columns with R directly within a RapidMiner process. Have fun to try out other data transformations!

I have also uploaded the process to myExperiment with our Community Extension . You can simply download it from there and directly try the scripting operator. The uploaded process also contains a parallel way for this calculation by using the native operator “Generate Attributes” instead.

 Plotter 13 Dec 2010 The RapidMiner Plotters 10: Series Multiple by Ingo Mierswa Comment (0)

Overview "Series Multiple"

• Summary: Useful plotter for multiple series in different value ranges
• Number of Dimensions: unlimited in theory, but useful up to a dozen
• Data Types: Numerical, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Series Multiple plotter, we will first have a look:

The series multiple plot is very similar to the series plotter discussed in the last session. Like the non-multiple variant, it is the basic reprensentation for series values, i.e. in cases where the values are ordered by a known or unknown dimension. It most often is used for time series plots where the x-axis is representing the time and the change of values in the series is plotted according to the value range depicted by the y-axis.

The series values have to be encoded in the attributes / columns of the data table. This means that each series is stored by a single column in the table. In the example above, we have selected three series in the Plot Series selection list on the left. The trajectory of those series is now plotted and each series gets its own color. Those colors are explained in a legend at the left of the plot together with the different value ranges. The fact that different ranges are used for the selected dimensions is the major difference to the traditional series plot.

Just as for the traditional series plotter, the multiple variant also offers a setting which is called index dimension. With this setting, you can select one of the dimensions which should be used for defining the range on the x-axis. If your data set contains, for example, a date column specifying on which date and / or time the measurements of the other series (the other columns) were taken, you probably want to define the date as index dimension.You can see an example for using an index dimension in our last session about the traditional series plotter.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

Other parts of the plotter series:

 OSBI, Business Intelligence, Business Analytics 12 Dec 2010 OSBI 2010 Review by Ingo Mierswa Comment (0)

Here is just a short review about the OSBI 2010 last week in Neustadt, Germany. The event took place in the beautiful Hambacher Schloss which is a historically really important place since here 30,000 people gathered in 1832 in order to define principles of freedom which ultimately led to birth of democracy in Germany.

I opened the day with a short introduction (see the picture below) and introduced Konstantin Böhm of Ancud IT who made a really great talk comparing the revolution of 1832 in Hambach with the IT-revolution we experience right now. This was a good start for this year's OSBI - thanks Konstantin!

After Konstantin's talk, the parallel sessions started. We had a couple of great talks in the big plenum and in parallel there were several workshops in which the current state of open source business intelligence solutions was demonstrated.

In one and a half of these workshops, we demonstrated our new solution RapidAnalytics , which is the first open source business analytics solution ever and comes with a complete server setup and web-based access and report designer. We got a lot of discussions and questions after the demonstration of RapidAnalytics - most of the questions were about the integration of data mining and predictive analytics in business processes and into reports, which both is child's play for RapidAnalytics.

We are glad that many of the participants already liked RapidAnalytics so much. We will release the Community Edition of RapidAnalytics next week so that you can soon check out some of its features yourself. Of course we are happy to give interested customers also an in-depth demonstration of the abilities of the Enterprise Edition of RapidAnalytics as well - just contact us if you are interested.

Below you can see the Rapid-I crew which visited the OSBI 2010. From left: Sebastian Loh, Dr. Simon Fischer, Nadja Mierswa, myself (Dr. Ingo Mierswa), Tobias Malbrecht, Helge Homburg, Christian Brabandt, and Ralf Klinkenberg.

We ended the day with a talk of Olaf Laber about Ingres VectorWise. Together with Ingres. we have shown the integration of RapidAnalytics and Ingres VectorWise in one of the workshops before (amazing: we reduced the time for the in-database-induction of a decision tree from 9 minutes to 15 seconds just by using VectorWise). In his talk, Olaf has presented some very interesting details of VectorWise and - as always when people hear about this fascinating technology for a first time - he quickly got a lot of discussion about the benefits of Ingres VectorWise.

We had an amazing program and we got a lot of positive comments for organizing this event. It was a lot of work but the day really was fun and I met so many wonderful people and exchanged a lot of ideas. Finally, I would also like to thank our partners Actuate, Ancud IT, Ingres, Jaspersoft, Jedox, Talend, and viadee for their participation and support!

 Data Mining 7 Dec 2010 Data Mining Map by Ingo Mierswa Comment (0)

Recently we had a discussion about good resources for data mining beginners in our forum. Well, there are a lot of books out there and I am not going to repeat the recommendations from the forum thread here.

However, I would like to add another resource which is quite helpful to understand many important concepts of data mining and how they relate to each other. Check out the data mining map of Dr. Sayad of the University of Toronto:

Of course, the texts behind the map are not complete in a sense that you will not need any other resource or that any topic is covered. But it is fun to browse through the concepts and delve deeper and deeper until you finally reach the more sophisticated algorithms.

Check out this nice training resource and have fun!

 Trading, R, Process, Example 6 Dec 2010 RapidMiner and R for Trading Part II: Genetic Optimization by Ingo Mierswa Comment (0)
A couple of weeks ago, the author Neural Concepts has posted a description of a a very interesting application for RapidMiner and its new R Extension. In his blog A Physicist in Wall Street , he has described a complete trading system based on this combination.

Our blog post about this financial data mining application quickly became one of our most often read articles here in this blog and so I am sure that many of you will be really happy to see that Neural Concepts has improved his processes by means of genetic optimization schemes and get much better results now. The following picture shows the return over time:

The goal still is the prediction of the next day's close price in order to generate buying and selling signals. But now the approach was improved a lot and described in detail in a nice video. Check out the original blog post for more details:

And here is the video showing all steps in detail:

 Math, Fun 3 Dec 2010 Fun Math Trick: Squaring Numbers Close to 100 by Ingo Mierswa Comment (1)

I just stumbled upon this nice little trick which helps you to square numbers which are close to 100.

Let's say, you want to calculate 105*105. Then you can simply add the difference between 100 and 105 which is 5 to the 105 and get 110 which are the first three digits. Then just add 5 squared and you will get 11025 which is the result. This works for all numbers up to 150 but is more useful if the number is close to 100. By the way, it also works for numbers smaller then 100 but in that case you simply have to subtract the difference to 100 from your number.

The following video is giving you some more examples and also shows what happens for larger distances:

Neat!