Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
Contest 17 Feb 2011
New Data Mining Contest by Ingo Mierswa Comment (0)

There is a new data mining contest on Research Garden in the field of nonlinear time series analysis. Come on, RapidMiners, get the 2500 Euro jackpot!

 

 

Research Garden

 

 

The goal of the contest is the development of a data analysis process which should optimize the performance of machines based on sensor data of those machines. The task comes from the field of multivariate, non-linear time series analysis and is rewarded with an overall price money of 2500 Euro. Nice.

More information can be found on the web site below (sorry, German only):

http://www.research-garden.de/web/guest/wettbewerbsdetails?cid=38180

Fun 16 Feb 2011
Go, Watson, Go: Win at Jeopardy with Basic Statistics by Ingo Mierswa Comment (2)
Well done, IBM. The new super computer named Watson was created and trained during the last 4 years by 25 IBM engineers in order to play (and win!) at Jeopardy. I just have viewed a short video about the event and the result really looks impressive:




Watson played quite well against two of the best Jeopardy players in the world. I especially liked to see the confidences at the bottom of the screen, this allowed me to check the quality of their model. And they did a good job: the clear cases were those where Watson was right in many cases.

Another nice thing was the reactions of the other contestants: Several times they seem to  know the answer (the question) as well but they are simply too slow.

And this was only day 1, on the second day of this three-day contest Watson performed even better. But after having digged a bit deeper I found out that the used techniques were pretty simple: at first, I thought that Watson understood the question by hearing instead of getting them directly. This is of course a big advantage since you don't lose any time with "understanding" what has been said or written. Talking about time, there is of course another big advantage of Watson that he does not lose any time for pressing the buzzer.

The basic thechniques are pretty simple as well: Watson stores about 200 million pages in a large search index - among them the complete Wikipedia - and searches for the given answer in those pages (ok, we probably all know how this works). From the top k results Watson extracts the most important person / concept / object etc. and creates an appropriate question. Little details have leaked about that but from that little I got the impression, that it's merely a topic detection or a named entity recognition and the confidence is based more or less on the average of the topic / NER confidences. Mix those simple ideas with the power of 2800 traditional computers and you get an impressive result...

The simple ideas most often are the most robust ones and the scientific and engineering efforts are impressive. Thanks, IBM, for those efforts and also for the positive effect this show probably has on the public acceptance of data mining and business analytics.
Plotter 9 Feb 2011
The RapidMiner Plotters 12: SOM by Ingo Mierswa Comment (0)

Overview "SOM"

  • Summary: Qualitative visualization of high-dimensional data sets on a 2-dimensional "geographical" map
  • Number of Dimensions: unlimited in theory, but results tend to get worse for larger numbers
  • Data Types: Numerical plus one numerical / nominal for point color

This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.

Before we start our discussion about the SOM plotter, we will first have a look on the final result:

 

A SOM (Self-Organizing Map) is a visual representation of your data set on a two-dimensional area which resembles a geographical map. The basic idea is that data points which are close together in the original high-dimensional space should also be close together in the resulting two-dimensional space. In order to visualize those distances in the resulting space, a color mapping is used.

Have a look at the map above. Mountains indicate that the distances between points are high. Deep sea means that those points are closer together. For example, the green points on the left are separated less from the red points in the lower left corner (upper arrow) than the green and blue points (lower arrow).

Another property of the map is that the top border and the bottom border are connected, i.e. it behaves like a world map. You can continue the map seamlessly from top to bottom. The same is true for the left and the right border.

Ok, after having seen the results we are after we will now have a look on how to create the plot and configure the plotter.

 

There is a major difference between the SOM plotter and other plotters in RapidMiner. Internally, a SOM is an unsupervised neural network. The data points are sorted to the nodes of the network. The consequence of this is that the network has to be trained for each data set anew. And as you might know if you are familiar with neural networks: the training can take some time. Therefore, most changes of the SOM settings will not have any affect until you press the Calculate button at the bottom of the plotter options on the left.

 

 

After having pressed the button, the calculation of the network is performed which might take some time. The progress indicator above the calculate button might give you a hint how long you will have to wait. After a couple of seconds (or minutes - depending on the data set), you will get the visualization of the two-dimensional map like in the following picture:

 

Please note that you will have to select a Point Color in order to show the data points on the  map. This most often will be the class of the data points or any other property you might be interested in. If you select an attribute of your data or model here, the values of this attribute will be used for determining the color of each of the data points. It does not matter if the selected column is numerical or nominal, both scenarios will work.

The next two options Matrix and Style are specific for the SOM visualization. With the Matrix option, you can choose if you want to display the distances (U-Matrix), the density of the data space (P-Matrix) or a combination of both (U*-Matrix). Please compare the difference between the U-Matrix (the picture above) with the U*-Matrix (the first picture in this post). The Style option indicated the color scheme which is used for displaying those information. The default Landscape produces a geographical map like the one, just play around in order to search a color scheme which is most appropriate for your data set.

As we have stated above, the SOM is internally represented by a network consisting of a fixed number of nodes. The size of this network can be determined with the settings Net Width and Net Height. There are also two important training options for the underlying neural network, namely the two training parameters Training Rounds and Adaptation Radius. The default values are fine for most settings but you might want to optimize those for certain data sets. After having changed those settings, you will have to re-calculate the plot again by pressing on Calculate.

Since the data points are initially all located on the network nodes, it often happens that multiple data points are located on a single node and are overlapping. For this reason, the Jitter option is very useful for SOMs. Just move around the jitter slider and look what's happening: the points are moving a bit to a random direction showing if and which points are lying below.

The last option is pretty simple and does the same as for the other plotters: Export Image opens a dialog which allows you to export the current plotter with all its settings into one of the dozens supported image formats.

A last note on SOM visualizations: the calculation of the neural network depends on random numbers and pressing Calculate another time might deliver a different - and sometimes more appropriate - result. Just try to recalculate a visualization by pressing Calculate again.

Other parts of the plotter series:

 

reportingRapidAnalytics Video Tutorial 3 Feb 2011
Using RapidAnayltics 4: Creating Simple Reports by Simon Fischer Comment (0)

In last week's post I have described how to create Web based dynamic charts out of RapidMiner processes. We now compose complex reports out of these charts. You will also learn how to define domains that make selecting parameters more comfortable for the report viewer, and help to ensure that only legal parameters are entered.

Next week we will see how interactions like drill-downs etc. can be defined.

TradingRapidMinerProcessesClustering 28 Jan 2011
Making people happy with Clustering by Ingo Mierswa Comment (3)

As a company, Rapid-I of course is interested in getting paid for our software and services. This should actually be loan enough but actually we often get much more for which we are also grateful: the look on people's faces seeing for a first time how well data mining works and what they probably can get out of it. This is amazing.

Being not only a company but even an open source company allows us to share this great feeling even more often. In his blog entry, Joshua Frankamp shares his experiences with us. Quote:

"This puts a huge grin on my face."

Why's this? Well, have a look  at the following picture:

 

Regular RapidMiner users can imediately tell that this is probably the plot of a hierarchical cluster model.

"It tells me that BAC, KEY, MI, RF, SNV, and STI are related. They’re all acronyms, yes, but more than that. They are all banking stock symbols. The node that the arrow points to contains these values. All banks. The nodes on either side contain exclusively real estate holding companies and home builders respectively."

Well, this might of course not be surprising from a data analysts point of view, but as far as I can see, Joshua has just started his data mining tour. And I got the feeling that now he got hooked and will probably never stop :-)

Ok folks, have fun discovering those and other - maybe more hidden - insights  with RapidMiner. And don't forget the look you had on your face the first time you have seen some correlation or causal connections - data analysis can be and should be surprising every day anew. Last quote from Joshua:

"Its great to be able to test out some of the things that I’ve been learning about in an environment that lets me try a lot of things in a relatively short amount of time. "

Two thumbs up, Joshua and keep going!

 

Web serviceRapidAnalytics Video Tutorialintegration 27 Jan 2011
Using RapidAnayltics 3: Exposing RapidAnalytics Processes as Web Services by Simon Fischer Comment (0)

In today's video tutorial I discuss one of my favourite features of RapidAnalytics. I'll explain how you can turn a RapidMiner process into a Web service. RapidMiner macros will be the parameters of the Web service, and output can be formatted in various ways. It can be either a presentation-oriented service (Flash charts, images, tables, etc.) or it can generate machine readable output in various formats. Thus, you can easily integrate RapidAnalytics with other IT infrastructure.

In a later post I will show how this can be used to create complex interactive reports.

OSBIEvent 21 Jan 2011
Pictures of OSBI 2010 now online by Ingo Mierswa Comment (0)

I know it's a bit late but anyway: the pictures of the OSBI 2010 are online now. You can find them on our Facebook page. There was also a review about the OSBI 2010 which you can find here .

 

tutorialSchedulingRapidAnalytics Video TutorialRapidAnalytics 20 Jan 2011
Using RapidAnalytics 2: Advanced Scheduling by Simon Fischer Comment (0)

 In the first video post about the basics of RapidAnalytics I had shown how RapidMiner processes can be executed on RapidAnalytics. This time, I'm going a bit more into the details and show some advanced scheduling features.

In this video you will learn how to

  • schedule processes for delayed execution,
  • schedule processes for regular execution,
  • use RapidMiner or the Web interface to do this,
  • monitor process execution on the Web and in RapidMiner, and
  • parametrizing the the executed process when submitting the schedule.
Open SourceEvent 19 Jan 2011
Meet Rapid-I at EOSD 2011 by Ingo Mierswa Comment (0)

 

Image

 

There is another opportunity to meet some Rapid-I guys, namely Simon and myself (Ingo) at a really cool event here in good old Germany. The upcoming Enterprise Open Source Day (EOSD 2011) is taking place in Nürnberg, Germany and offers a lot of interesting talks about many open source solutions. Rapid-I is not only giving a talk there but also a workshop about the brand-new RapidAnalytics.

The Enterprise Open Source Day is the major german conference and information platform for open source business solutions. Rapid-I and other vendors present their new products there and new trends are discussed in the talks and presentations. The additional technical workshops are the ideal way for deepening the knowledge about the presented solutions. Participants can also get more information at the info booths of the software vendors.

More information and the possibility for registration at

http://www.eosd2011.de/

 

Looking forward to meeting you at the EOSD 2011. Cheers,
Ingo

tutorialRapidAnalytics Video TutorialRapidAnalytics 12 Jan 2011
Using RapidAnalytics 1: Storing Data and Executing Remote Processes by Simon Fischer Comment (0)

Today we'll start a new video series demonstrating how to use RapidAnalytics for your analytic work. RapidAnalytics is the new data mining server solution that uses RapidMiner both as a data mining engine and as a fron-end to design data mining processes.

In this first video you will learn how to

  • use the Web interface,
  • connect RapidMiner to  RapidAnalytics,
  • store data on RapidAnalytics,
  • execute processes on RapidAnalytics, and
  • open the results in RapidMiner.

To reproduce what is shown in the video, you need to install RapidMiner 5.1 and RapidAnalytics 1.0, both of which is available from our Web site.

This is only the beginning of the new series. In subsequent posts you will learn how to schedule processes regularly, expose processes as Web services to embed them into other applications, how to generate fancy Web-based dynamic reports based on RapidMiner processes, and more.

  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter