Open Source Software für Big Data Analytics.Ohne Programmierung.
Rapid-I Blog
Blog Tags
 Benutzername Passwort Angemeldet bleiben Passwort vergessen? Noch kein Benutzerkonto? Registrieren
 Plotter 2 Dec 2010 The RapidMiner Plotters 9: Series by Ingo Mierswa Comment (0)

Overview "Series"

• Summary: Natural representation for ordered values like for measurements changing over time
• Number of Dimensions: unlimited in theory, but useful up to a couple of dozens
• Data Types: Numerical, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Series plotter, we will first have a look:

The series plot is the basic reprensentation for series values, i.e. in cases where the values are ordered by a known or unknown dimension. It most often is used for time series plots where the x-axis is representing the time and the change of values in the series is plotted according to the value range depicted by the y-axis.

The series values have to be encoded in the attributes / columns of the data table. This means that each series is stored by a single column in the table. In the example above, we have selected three series in the Plot Series selection list on the left. The trajectory of those series is now plotted and each series gets its own color. Those colors are explained in a legend at the top of the plot. As you can see, the values are quiet similar for the three series. They only differ a bit more between for example 30 and 40 on the x-axis.

Sometimes, you want to display a value range which is defined by two series (like one series for the possible minimum values for each point of time and one for the corresponding maximum values). It is then nice to show one or several additional series and be able to compare them against each other or against the depicted range. This could look like in the following plot:

As you can see, we have selected one series for the lower bound and one for the upper bound. The range in between is painted in a transparent grey and the mean value is depicted by a dashed line. This grey area is named bounds in the legend at the top. We have created two new columns named Min and Max with the operator Generate Aggregation in advance and selected those two attributes for the bounds.

Another setting which is often used is the index dimension. With this setting, you can select one of the dimensions which should be used for defining the range on the x-axis. If your data set contains, for example, a date column specifying on which date and / or time the measurements of the other series (the other columns) were taken, you probably want to define the date as index dimension. The following picture shows how this looks like:

As you can easily see, the year is now written at the x-axis. We have data spanning about 10 years here. Of course, you can also zoom in and pan (move) in the zoomed plot. This described in the first part of our plotter series. If you zoom in, the index dimension is updated accordingly as can be seen in the following picture where we concentrate on one and a half years only:

The last option is pretty simple and do the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 RapidAnalytics, OSBI, Business Intelligence, Business Analytics 1 Dec 2010 Final Reminder for OSBI 2010 by Ingo Mierswa Comment (0)

Ok, I suppose most of our frequent readers are pretty aware that we will have another big event coming soon (after our real successful first community meeting and RapidMiner conference RCOMM in September of this year, here is part 2 of the report ).

The next big event is called OSBI 2010 which stands for Open Source Business Intelligence Day 2010.  Last year, the OSBI was a great success. Together with our partners Ancud IT, Avantgarde Labs, Ingres, Jaspersoft, Talend, and viadee we offered an impressive program for about 100 participants and got a very positive feedback.

This year, the partner line up is almost the same, but we also got our new partners Actuate, who created the BI solution Birt and Jedox with their solution PALO on board. This led to an equally impressive program which can be checked out here:

As you have seen yesterday, Rapid-I also has a surprise to offer for the OSBI 2010: we will demonstrate for the very first time the complete business analytics solution called RapidAnalytics. And we will also show its combination with the exciting new database Ingres VectorWise which really accelerates data access and analysis.

Instead, take your chance and get one of the last available tickets for the OSBI 2010 and be there when we, our partners, and our customers and users demonstrate the latest results and possibilities for open source business intelligence applications!

Other posts about the OSBI 2010:

 RapidAnalytics, Business Analytics 30 Nov 2010 RapidAnalytics released at OSBI 2010 by Ingo Mierswa Comment (0)

Finally, Rapid-I will release RapidAnalytics to the general public - until now we used RapidAnalytics in pilot customer projects only. We are happy to invite you to the public release of the RapidAnalytics during the second Open Source Business Intelligence Day (OSBI 2010).

Everybody knows business intelligence (BI). In traditional BI, we focus on using a consistent set of metrics to measure past performance and present those measurements to users in order to support them in business planning. Presenting past data of course is only the minimum level necessary for informed decision making and business planning.

There are, however, two drawbacks with traditional BI: the retrospective usage of data alone does not deliver insights about expected outcomes and there is hardly any connection between business intelligence and the underlying business processes. In other words: there is no feedback loop and hence also no real-time integration of analysis results into the business processes themself.

RapidAnalytics is coming to your rescue: it is actually the first open source business analytics  solution available. It covers the complete flow from Analytical ETL to Predictive Reporting in a server-based solution built around RapidMiner, the world-leading open source data mining solution. And the process-oriented approach of RapidMiner and RapidAnalytics allows the direct and even real-time integration into business processes.

The following video demonstrates some of the features of RapidAnalytics. It offers features like remote execution, scheduled processes, quick web service definitions, and a complete web-based report designer:

A visit of the OSBI 2010 is recommended for all interested users who want to get a hands-on demonstration of RapidAnalytics. But there is much more: many interesting talks and workshops of our partners Actuate, Ancud, Ingres, Jaspersoft, Jedox, Talend, and Viadee will cover all relevant aspects of open source business intelligence.

More information and the online registration form for the OSBI 2010 can be found at

http://www.osbi2010.de

News article announcing release of RapidAnalytics at OSBI 2010:

The RapidAnalytics web page can be found at

 RapidMiner 26 Nov 2010 Short Review: RapidMiner one of the Top Data Mining Systems by Ingo Mierswa Comment (0)

I am of course always happy when people appreciate our work and write something nice about RapidMiner or our other solutions or work. I want to share a short review and recommendation about Top Business Data Mining Systems written by Rob of IT Performs:

Rob has recommended RapidMiner as data mining solutions and clearly detected  the major reasons for using RapidMiner:

"With analytical ETL, reporting, data analysis and integration, along with a really nice GUI, it makes a powerful combination."

and

"The hundreds of modifications, extras and updates are one of the main benefits of this system, thanks to the open source nature of the programming."

I fully agree and want to take the chance to thank all of you who considered to extend RapidMiner or optimize it and share the results again with us and all community members!

 Plotter 25 Nov 2010 The RapidMiner Plotters 8: Deviation by Ingo Mierswa Comment (0)

Overview "Deviation"

• Summary: Great for large numbers of dimensions and also high numbers of examples
• Number of Dimensions: unlimited plus 1 as color
• Data Types: Numerical, Nominal, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Deviation plotter, we will first have a look:

The deviation plot is a high-dimensional visualization techniques for almost arbitrary numbers of dimensions. It is very similar to the Parallel plot discussed in the last session . But in contrast to the parallel plot, the deviation plot does not show the set of data points as lines in an n-dimensional space but one line for a group of examples only.

Like for the parallel plot, each dimension is displayed as a vertical grid line which is parallel to the other dimension grid lines. Now, the average and the standard deviation for each attribute are calculate and all average values are then represented as a line consisting of multiple line segments. This average line (printed in bold) intersects the dimension lines at the position on the dimension line corresponding to the average values for the attributes. In addition to the average values, a transparent region is drawn around the bold average line indicating the range of the standard deviations for all attributes. You can now easily see the mean value together with the range in which most examples usually are located in - and this for a quite large number of attributes.

The deviation plot becomes even more powerful if you select a nominal attribute to be used as the Color of the lines. This definition basically works like a grouping of the examples before the average values and standard deviations are calculated. The lines and transparent regions are then drawn for each of the possible nominal values of the attribute selected for the col0r. Those colors are explained in a legend at the top of the plot. This becomes quite handy if you select a class label for the color since in many cases this allows you to see first hints which attributes are well suited for classifications. Those attributes differ more in the average values (the bold lines) and have less overlapping transparent regions. Here is a plot of the same data set like the one above (it is called Sonar and is delivered as sample data together with RapidMiner) but now we use the class label for the color:

It is very easy to see that there are two or three attribute regions which might be more helpful to distinguish between the two classes. For exactly this reason, the deviation often plot offers insights where the traditional parallel plot is not able to show anything simply because of the high number of examples.

The deviation plot also has an additional parameter called Local Normalization which can be used to rescale the values in all dimensions on a range between 0 and 1 (normalization). This is quite helpful if the values differ a lot between the dimensions.

The last two options are pretty simple and do the same as for the other plotters: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 RapidMiner, Preview, Date 21 Nov 2010 Preview: New Date Functions for Attribute Generation by Ingo Mierswa Comment (0)

Recently we improved the creation of new attributes a lot. The operator "Generate Attributes" is the basis for those calculations. Here, the analyst can define a list of expressions which are evaluated and new attributes can be calculated based on the values of already existing attributes. This made the "Generate Attributes" operator to one of the most important operators for data preprocessing.

We already have discussed two additions which will be published with the next major release 5.1 of RapidMiner:

Today we are happy to announce a third addition for the "Generate Attributes" operator, namely the ability to deal with dates and perform calendar operations.

Let's have a look to the supported functions:

In the example above, the difference between now calculated with date_now() and the date stored in the column named "Datum" ist calculated with the function date_diff() and parsed. The result can then for example be further processed, e.g. with the operator "Date to Numerical" which would extract the number of days of this difference.

The date functions supported by the operator "Generate Attributes" are:

• date_parse(): Parses a date given as string or as number of milliseconds
• date_parse_loc(): Same as date_parse() but using a specified locale
• date_parse_custom(): Same as date_parse() but using a specified format
• date_before(): Compares two dates and returns true if the first date is before the second
• date_after():  Compares two dates and returns true if the first date is after the second
• date_str(): Transforms a date to a string representation
• date_str_loc(): Same as date_str() but using a specified locale
• date_str_custom(): Same as date_str() but using a specified format
• date_now(): Creates the current date and time
• date_diff(): Calculates the difference between two dates
• date_add(): Adds a specified amount of time to the given date
• date_set(): Sets a specific part of the given date
• date_get(): Delivers a specific part of the given date

Together with the already existing operators "Date to Numerical", "Date to Nominal", "Numerical to Date", "Nominal to Date", and "Adjust Date", these new date functions for the attribute generation build a powerful base for all types of date transformations. Have fun and stay tuned - the next version RapidMiner 5.1 will be released soon!

 Trading, Process, Example 19 Nov 2010 RapidMiner and R for Trading by Ingo Mierswa Comment (3)

Hi,

one of our forum members has posted a link to a very interesting application for RapidMiner and its new R Extension. In the blog A Physicist in Wall Street , the author Neural Concepts describes a complete trading system based on this combination.

In order to check the power of RapidMiner + R, the author made a simple example using an algorithm based on a support vector machine for predicting the next day's price and based on it I generated buying and selling signals. Check out the original blog post for more details:

The requirements needed to build the model are, of course, RapidMiner, the Weka Extension, the Time Series Extension and the R Extension. This requires installing R with quantmod, TTR and PerformanceAnalytics packages.

The greatest thing about his post is the fact that every single step is described in high detail. It seems that the results can be easily reproduced - which, I have to admit, I dit not try myself yet. Another nice option: you can donate a small quantity and this would allow you to download the processes etc. directly.

 Open Source, Data Mining 17 Nov 2010 Be cautious about open source data mining by Ingo Mierswa Comment (0)

Yesterday I stumbled upon an article called "Be cautious about open source data mining" written by Anh Nguyen about a talk given by Jos von Dongen at the Predictive Analytics World in London. My initial thought was just like "ok, the author is probably just a partner of some proprietary software vendor living great from the sales commissions for the sold licenses".

Hence, I did not expect anything neutral and objective but a completely proprietary-vendor-X-oriented article describing with greatest eloquence why proprietary solution X is so much better than any open source solution. Things like: Those open source solutions are free. They simply cannot work - for exactly this reason. And they are of course a danger not only for the complete IT infrastructure but also for the analyst's mind and of course for the whole enterprise. Which is by the way very likely to break down simply by introducing something they did not paid millions of license fees for. I have actually read enough articles like that before and initially I did not want to give this one another chance.

Since I had to wait for another couple of minutes before a meeting started, I clicked on the link and was deeply surprised. There were a set of theses which were completely reasonable. I liked those and hence I want to comment on them and extend them a bit:

"It's free but should be evaluated like any other software"

This is actually nothing new and I fully agree. Of course I like what we are doing here at Rapid-I and personally I think RapidMiner / RapidAnalytics are among the best solutions for almost every aspect of data analysis you can think of. Nevertheless, there are situations where other solutions might be more appropriate. At least there is a chance for this, so you should give all options a try. What did you just say? This is not easy since not all options are delivered as open source solutions? Right. But that's hardly our fault...

"It doesn’t matter if the software is free if it takes longer to build, manage and deploy solutions to end users, or if it is unstable, or missing key features. Don’t select just because it is open source”

Again I fully agree.  Choosing a solution simply because it is an open source solution is probably as stupid than avoiding it for exactly that reason. Among the potential drawbacks connected to maintaining the software or software quality, I would like to add that exactly for this reason the successful commercial open source companies like Rapid-I offer their Enterprise Editions. Those editions help to overcome those software issues by providing stabilized releases, higher levels of quality assurance, and full support. If you want a fair comparison, you should go for the now-no-longer-free Enterprise Editions and compare those against proprietary solutions. By the way: from my experience, maintaining a software or worrying about missing features feels exactly the same for open and closed source products. There is no general difference connected to the software per se but to the service quality of the companies.

"van Dongen believes that if a business does not have any existing tools for data mining, they should make open source the default option. "

This is the strongest claim and I want to support it. The quintessence here is: if there already is a software solution for data mining, I think the optimal way is not to rip it out of your infrastructure and directly and completely replace it by an open source solution. Think gradually and employ RapidMiner for the next project before stocking up your licenses for the other solution. Or make it the default if you don't have a solution at all and have to get used to a new solution for data mining or business analytics anyway. We experienced all three ways during the last years: moving gradually from a closed-source solution to RapidMiner from project to project, starting with RapidMiner as primary data mining solution right away, and directly replacing the old solution by RapidMiner at once. I must say that the last way was the hardest option for all people involved in those projects. But this is nothing special to open source again but to replacing or migrating between different types of software in general.

Oh, and by the way: Another fact I really liked that van Dongen and Anh Nguyen recommended RapidMiner as open source solution for data mining. That made me liking this article even better than I did before anyway :-)

Here is a PDF file containing the article if it has been removed from the web.

 Videos, Text Mining 16 Nov 2010 Great Video Series about Text Mining by Ingo Mierswa Comment (0)

Hence, he posted a total number of five videos of about 10 minutes.

Neil did a great job and produced the videos along a sample application based on a popular job posting board. The five videos cover the following topics:

Thanks again Neil for this amazing video series! And please check out his blog and his other posts including also other videos about using RapidMiner.

 Plotter 15 Nov 2010 The RapidMiner Plotters 7: Parallel by Ingo Mierswa Comment (0)

Overview "Parallel"

• Summary: Great for large numbers of dimensions and a small number of examples
• Number of Dimensions: unlimited plus 1 as color
• Data Types: Numerical, Nominal, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Parallel plotter, we will first have a look:

The parallel plot is a high-dimensional visualization techniques for almost arbitrary numbers of dimensions. It shows the set of data points as lines in an n-dimensional space, one line for each example. Each dimension is displayed as a vertical grid line which is parallel to the other dimension grid lines (hence the name).  Each data point is then represented as a line consisting of multiple line segments. Those example lines intersect the dimension lines at the position on the dimension line corresponding to the value of the point for this attribute. This makes the parallel plot well suited in cases where you have many different attributes and not too many examples as you can see above.

The parallel plot also is the natural representation of ordered attributes like for time series data in cases where the series is encoded by the attributes instead of the examples. In the latter case you would use a specialized series plotter which will be covered later in this series.

One dimension can be defined to be used as the Color of the lines. If you select an attribute of your data or model for the color, the values of this attribute will be used for determining the color of each of the data points. Those colors are explained in a legend at the top of the plot. It does not matter if the selected column is numerical or nominal or dates, all scenarios will work. All other dimensions can also be of those types.

In many cases, a parallel plot already shows a lot of dependencies between the attributes and sometimes it even shows which attributes are probably important for classification tasks. Have a look at the plot of the famous Iris data set:

It can easily be seen that the attributes a1 and a2 are less helpful for distinguishing between the three classes of flowers. The values of the data points in the dimensions a3 and a4, however, are quite different for the classes and allow already for a visual separation which can easily be seen.

The parallel plot has an additional parameter called Local Normalization which can be used to rescale the values in all dimensions on a range between 0 and 1 (normalization). This is quite helpful if the values differ a lot between the dimensions.

The last two options are pretty simple and do the same as for the other plotters: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series: