Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
TalksConferences 8 Nov 2010
Talks, Talks, Talks by Ingo Mierswa Comment (0)

During the next couple of weeks, there are several opportunities for learning more about data analysis with RapidMiner and RapidAnalytics as well as about other open source solutions and their business applications. I would love to meet you at one of the following events:

 

OSBI 2010

OSBI 2010

December 8th, 2010 at Hambacher Schloss, Neustadt, Germany

Rapid-I invites you to the second Open Source Business Intelligence Day (OSBI 2010). Last year, about 100 visitors participated in this unique event and we are happy to again offer an impressive program. Beside Ancud IT, Ingres, Jaspersoft, Talend and Viadee also Actuate (BIRT) and Jedox (Palo) will participate for the first time.

Learn more about all leading Open Source BI solutions during a single day!

 

SOSOCON 2010

  SoSoCon 2010

December 1st, 2010 at Messe Hannover, Germany

I will give a talk together with Olaf Laber of Ingres about new challenged related to extremely large data sets and the current state of the RapidMiner / RapidAnalytics integration with Ingres VectorWise. I am sure that many of you already know VectorWise - at least the visitors of the RCOMM 2010 got already a first impresssion about how fast and scalable data analysis becomes with this combination.

Learn more about speeding up your data analyses with RapidMiner + Ingres VectorWise!

 

SwissICT

Swiss ICT

 November 22, 2010 at Technopark Zürich

Another talk given by me and Dr. Dimitre Leonidov, Account Manager at Ingres, which shows Data Mining 2.0 can be performed on extremely large data sets based on the combination of RapidMiner and Ingres VectorWise.

Learn more about the combination of RapidMiner / RapidAnalytics with Ingres VectorWise!

 

ISI 2010

ISI 2010

November 11, 2010 in Dolenjske Toplice, Slovenia

Last but not least I am giving a keynote speech called Data Mining for the Masses: Supporting Non-Analysts in Analysis Process Design at the ISIT 2010 on 11. November in Dolenjske Toplice (Slowenien). In this talk I show how easy data mining can be today and that data analysis actually can be ubiquitous.

Learn more about my keynote speech and the conference!

 

Looking forward to meeting you there!

Insights 5 Nov 2010
Market Perception for Open Source vs. Closed Source by Ingo Mierswa Comment (0)

Google Insights is a funny tool showing you how many search search requests have be done for the specified keywords during the last years. Of course sometimes it is not really a trend which can be seen but simply a replacement of names or concepts, but nevertheless sometimes you can get also some interesting insights.

A couple of weeks ago I was asked during an interview if proprietary vendors of data mining solutions like SAS or SPSS already notice a drop in sales due to the pressure caused by the low license costs offered by our open source solutions.

Well, I can of course not really answer this but at least I find it interesting enough to simply compare awareness and market perception for RapidMiner as open source solution compared to, let's say, SAS Enterprise Miner. And that's exactly what I did:

 

 

No question that those trends are neither accurate nor representative but hey, there seems to be at least some movement in the market. Those are exciting times and I am eager to see how the IT world is changing during the next years.

RapidMinerPlotter 3 Nov 2010
The RapidMiner Plotters 6: Bubble by Ingo Mierswa Comment (1)

Overview "Bubble"

  • Summary: Showing dependencies between four dimensions on a 2D plane
  • Number of Dimensions: 2 plus 1 as size and 1 as color
  • Data Types: Numerical, Nominal, Dates
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Bubble plotter, we will again first have a look:

 

 

The bubble plot is a simple two-dimensional plot with two axes: x and y. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those two axes. In this sense, the bubble plot is very similar to the traditional scatter plot . In addition to this plotter, the bubble chart can define two extra dimensions. One dimension is displayed by the size of the bubbles (hence the name) and one dimension can be defined to be used as the Color.

If you select an attribute of your data or model for the color, the values of this attribute will be used for determining the color of each of the data points. It does not matter if the selected column is numerical or nominal or dates, all scenarios will work.

In all cases you will get a legend at the top which uses different colors for each possible value. The other dimensions, namely x, y, and the size can also be of those types.

The options for the dimensions are exactly the same as for the scatter plot and are described here only for the sake of completeness.

As you can see, you can also identify if the selected attribute should be transformed on a log scale. Just check the box below the corresponding axis.

The last two options are pretty simple: Rotate Labels causes that the labels of the x-Axis are rotated by 90 degrees. Especially if you use a nominal attribute for the x-axis, the values can then be easily read. Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

 

Zooming and Panning

In general, you can zoom into a RapidMiner plotter by dragging a zomming rectangle indicating which part of the plot should bedrawed to a larger scale.

In order to zoom in, please drag the rectangle from the top left to the bottom right, in order to zoom out, please simply drag a rectangle in the opposite direction, i.e. from the bottom right to somewhere upper left. Simply try this, you will quickly get used to this.

If you have zoomed in, you probably want to move around in order to watch other parts of the plot. This movage in a plot is called panning and can be done by holding down the CTRL key while dragging the mouse. The plot will then be moved into the direction you dragged the mouse to.

 

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 

Social NetworksAnalytics 2 Nov 2010
Next Generation and Social Analytics among Gartner Top 10 for 2011 by Ingo Mierswa Comment (2)

Gartner, Inc. highlighted the top 10 technologies and trends that will be strategic for most organizations in 2011 . Those technologies are assumed to have the potential for significant impact on the enterprise in the next three years. Factors that denote significant impact include a high potential for disruption to IT or the business, the need for a major dollar investment, or the risk of being late to adopt.

Among the selected technologies, two of them are highly related to data mining:

Next Generation Analytics. "[...] It is becoming possible to run simulations or models to predict the future outcome, rather than to simply provide backward looking data about past interactions, and to do these predictions in real-time to support each individual business action. [...]"

Social Analytics. "Social analytics describes the process of measuring, analyzing and interpreting the results of interactions and associations among people, topics and ideas. [...] Social network analysis involves collecting data from multiple sources, identifying relationships, and evaluating the impact, quality or effectiveness of a relationship."

I must say that I agree on most of the selected technologies and I am particularly happy that with Next Generation Analytics and Social Analytics two areas were selected which are strongly covered by Rapid-I and our products!

I also find it quite interesting that ubiquitous computing finally is a part of this list. Personally, I follow the thought of my PhD supervisor Katharina Morik who strongly believes that the combination of data analysis with ubiquitous computing will become one of the next hot topics for analysts.

Here is a good comparison of 2010 and 2011 top technologies from 10 things: Handicapping Gartner's top technologies for 2011, by Larry Dignan, ZDnet:

Fun 28 Oct 2010
Juggling in a Cone by Ingo Mierswa Comment (0)

This video is just for fun for those readers interested into the mathematical and statistical backgrounds of data mining as well. It shows Greg Kennedy standing in an 8-foot high inverted cone. He starts juggling of 3, 5 & 7 balls on the inside surface and makes great use of the principles of geometry and physics:

 

 

Greg is well known worldwide not only for traditional juggling but also for creating entirely new forms of manipulation. Visit www.innovativejuggler.com for more info.

SVMRapidMiner 27 Oct 2010
X-Validation with One-Class SVM (myexperiment) by Ingo Mierswa Comment (0)

Marco Stolpe recently publisheda nice process on the myExperiment platform showing how you can calculate the performance of a one-class SVM. 

 

 Process

 

RapidMiner includes the One-Class SVM as a part of the LibSVM operator. The basic idea is to partition the data set as usual and to train the one class classifier only on one class inside the cross validation, but to test it on both classes for the part that was left out for testing. So for training, we have to remove all examples with the label we don't want to train the classifier on (Filter). As the SVM operator expects the nominal label values to consist of only one value, we need to map the nominal value "Mine" to "Rock". When we apply the one class model to our test data, we get "inside/outside" as a prediction. These values have to be mapped back to the original corresponding nominal values "Rock" and "Mine". Afterwards, we can use the standard performance operator.

Nice idea! Please note, however, that the data set (Sonar) is just a toy data set chosen for demonstration. More information can be found on the process web page at myExperiment .

The complete process can be downloaded with our Community Extension . The name of the process is "X-Validation with One-Class SVM".

RapidMinerPlotter 26 Oct 2010
The RapidMiner Plotters 5: Scatter 3D Color by Ingo Mierswa Comment (0)

Overview "Scatter 3D Color"

  • Summary: Showing dependencies between four dimensions
  • Number of Dimensions: 3 plus 1 as color
  • Data Types: Numerical (dates and nominal values will be printed by their internal values), the color dimension can also be Nominal
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first part of the series before reading this one.

Before we start our discussion about the Scatter 3D Color plotter, we will again first have a look:

 

Scatter 3D Color Plotter

 

The scatter 3D color plot is very similar to the usual 3D scatter plotter discussed before. It is a simple three-dimensional plot with three axes: x, y, and z. If you plot a data set, each point will be located at the position which corresponds to the values with respect to those three axes. In addition to the non-color plotter, you can now define one extra dimension as Color instead of having multiple z-dimensions.

 The color dimension supports numerical as well as nominal values. In both cases you will get a legend at the top which uses different colors for each possible value.

The other options are exactly the same as for the non-color 3D scatter plot and are described here only for the sake of completeness.

The 3D plots are somewhat special in terms of additional options. In general , you can zoom into RapidMiner plotters by dragging a zooming rectangle and you pan, i.e. move around in a plot by holding down the CTRL key while dragging the mouse. For the 3D plots in RapidMiner, however, you have three different modes: Panning, Zooming, and Rotating.

Those different modes are indicated by the icons in the lower part of the options. If you select the first mode, indicated by a lense with arrows, you will change into the  Panning mode. Now you can drag the mouse and this will move the complete plot across the screen.

The second icon indicates the Zooming mode, where you can select a zooming region by dragging a rectangle or you can simply use the mouse wheel for zooming in and out.

The third icond indicates the Rotation mode, which is the default. Dragging the mouse will now not zoom or pan the plot but rotate it. Simply try it and you will quickly get used to it.

If you have have the feeling that you want to go back to the default view, the fourth icon will help: this one will reset all views and axes to the initial default. And there is one last option, called Set Scales, which will bring up a dialog which allows to rename the axes, apply logarithmic scales or adapt the ranges for the axes. Please refer to the non-color 3D plotter for additional information and screenshots.

I hope that you like this series of plotter explanations. Please let me know what you think and if we should continue with this!

Other parts of the plotter series:

 

Regular Expressions 22 Oct 2010
Regular Expression Madness by Ingo Mierswa Comment (1)

I just stumbled upon this great blog post about some uncommon uses of regular expressions. RapidMiner also makes a lot use of those beasts, especially for the definition of filters so I thought this post might be interesting to you.

Both examples are taken from the book The Unix Programming Environment by Kernighan and Pike (1984).

The first problem is to produce a list of all English words that contain all five vowels exactly once and in alphabetical order.

The book creates a regular expression

^[^aeiou]*a[^aeiou]*e[^aeiou]*i[^aeiou]*o[^aeiou]*u[^aeiou]*$

then uses it to filter a dictionary file. This produced 16 words ranging from abstemious to majestious.

The second problem is to produce a list of all English words of at least six letters with letters appearing in increasing alphabetical order.

The book creates a regular expression

^a?b?c?d?e?f?g?h?i?j?k?l?m?n?o?p?q?r?s?t?u?v?w?x?y?z?$

then uses it to filter a dictionary file as before, except there is an additional filter stage.

This produced 17 words including common words such as almost and ghosty. Some of the more interesting results were bijoux, chintz, and egilops. Kernighan and Pike explain that egilops is a disease that attacks wheat.

For an explanation of those expressions please refer to the original blog post . And have fun while you are creating similar expressions for your next example filter ;-)

 

 

RapidMinerPreview 21 Oct 2010
New GUI for Generate Attributes: Calculator Style by Ingo Mierswa Comment (2)

I just wanted to show you the new graphical user interface for the expression creator of the quite important operator "Generate Attributes". Thomas Ott of Neural Market Trends recently made a video about this great operator which in general can be used to define new attributes (column, dimension...) based on a calculation on already existing ones. For example, you could create a new attribute named "area" by calculating the product of the two attribute "width" and "length" with the formula "width * length". Pretty easy, huh?

Recently, we introduced a whole set of new functions which can be used to work on text data. Now it is even possible to extract substrings from nominal values like in "cut(att1, 2, 5)" which creates a new attribute with the substring of length 5 starting at position 2 of the values of attribute "att1". Together with the numerous numerical functions and special functions like if-then-else conditions and others, the total amount of supported functions hence grew a lot. And for exactly that reason we decided to develop a new user interface for the expression generation which now follows a nice calculator style:

 

 

As you can easily see, the user interface is inspired by a calculator. At the top, we have the actual expression which is created with the help of the other elements. Of course, you can type in any part of the expression into the field yourself at any time.

In the lower left part, you will find all available functions. You can change the currently displayed set by selecting a different function type in the combo box. On the lower right part you will find a list of all known attribute names. This list is only filled if the meta data is available but you can of course simply type in the name if you want and it is unknown.

The usage of the new user interface is pretty simple: just click on a function the selected part will automatically become the argument of the new function.in order to add it at the current caret position in the expression field. The caret is then placed in the parentheses so you can directly edit the function arguments. If you add an attribute by double clicking on one of them in the attribute list on the right, is is also added at the current position and the new caret position will be directly after the added attribute. You can also select some text in the expression field before you add a new function: in this case

By the way, this is how to start the new user interface: simply click on the small calculator icon on the right of the expression field in the expression definition list of the Generate Attributes operator:

 

 

Another cool thing: each change - either manual or by using the elements - will trigger a validation  check. If the check was successful, this is indicated by the small green check on the right. If not, a red cross appears and a tool tip explains why the expression can not be successfully validated. This again is a great help for analysts who do not want to wait until a long-running process crashes since there was an error in the function expression.

This new calculator and the new text functions will be delivered with the next release of RapidMiner coming soon!

OSBIOpen SourceBusiness Intelligence 20 Oct 2010
Open Source Business Intelligence Day 2010 by Ingo Mierswa Comment (0)

Rapid-I is happy to organize the second Open Source Business Intelligence Day (OSBI 2010) on December 8th, 2010 at the beautiful castle Hambacher Schloss in Neustadt an der Weinstraße, Germany.

The OSBI 2010 is a place where analysts and decision makers can meet and exchange their experiences with open source solutions for data warehouses, ETL, OLAP, Reporting, and Data Mining. The event is supported by our partners Actuate (BIRT), AncudIT, Ingres, Jaspersoft, Jedox (PALO), Talend, and viadee, among them all leading providers for open source business intelligence solutions.

Since the number of participants is restricted we recommend to register early for this event. The program as well as an online registration form are now available on the OSBI 2010 web site (German only).

 

 OSBI 2010

 

We would be happy to meet you at the OSBI 2010. Take the opportunity and get information about the Open Source BI solutions

  • Jaspersoft,
  • Pentaho,
  • PALO,
  • BIRT,
  • Talend, and
  • RapidMiner / RapidAnalytics

and benefit from the experiences of other companies which already successfully deployed open source BI solutions. As you can see, the focus of the OSBI 2010 is on:

  • database systems, data warehouses and data quality,
  • data integration and ETL,
  • reporting and OLAP,
  • data mining and predictive analytics.

The OSBI 2010 provides you the opportunity to gather information in lectures, experience reports, workshops, and success stories about relevant and crucial aspects of business intelligence. You can meet the vendors at their information desks in order to get more information about their products. Additionally, coffee breaks and the flying buffet give you more chances for inspiring discussions and personal exchange of experiences.

I am looking forward to meeting many of you again at OSBI 2010! The program as well as an online registration form can be found at

OSBI 2010 web site

  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter