Open Source Software für Big Data Analytics.
Ohne Programmierung.

HomeKontaktSucheSitemapDatenschutzImpressum
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Passwort vergessen?
Noch kein Benutzerkonto?
Registrieren
Tag >> Data Mining
researchRapidMinerData Mining 9 Jan 2012
The Intelligent Discovery Assistant by Simon Fischer Comment (0)

Imagine all you would have to do for creating a data mining process was to select a data set and specify what you want to do with the data, e.g. predictive modelling. Wouldn't that save a lot of work?

Within the research project "e-LICO", funded by the EU within the 7th Framework Programme, the Intelligent Discovery Assistant (IDA) was  developed, and it does precisely that. It comes with its own perspective (marked with the silhouette of a friendly butler) that contains all you need: The repository and the assistant itself. To use it, follow three simple steps:

  1. Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
  2. Select a goal. The most frequent one is probably "Predictive Modelling". All goals have comments, so you see what they can be used for.
  3. Select "Fetch plans" and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.

The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.

You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL. Alternatively, just download it directly and place it in RapidMiner's lib\plugins folder.

Since the workflow planning happens in Prolog, this extension  automatically installs a Prolog engine (XSB Prolog plus Flora 2). It will do so when it first starts. These can only be installed into a specific directory, so you must run RapidMiner as administrator when using the extension for the first time. (On Windows, righ-click and "Run as administrator").

If you try out the extension, we ask you to participate in the user survey so we can keep improving the extension. You can easily open the survey by installing the extension and clicking on the third button in the toolbar (the one with the letter box).

The IDA was developed as a collaboration mainly between the University of Zurich (Jörg-Uwe Kietz and Floarea Serban) and Rapid-I.

RCOMMEventsData MiningContest 2 Nov 2011
Who Wants to be a Data Miner? (RCOMM 2011) by Simon Fischer Comment (0)

One of the most fun events at the annual RapidMiner Community Meeting and Conference (RCOMM) is the live data mining process design competition "Who Wants to be a Data Miner?" In this competition, participants must design RapidMiner processes for a given goal within a few minutes. The tasks are related to data mining and data analysis, but are rather uncommon. In fact, most of the challenges ask for things RapidMiner was never supposed to do.

In 2010, we had posted the winning processes immediately after the conference. This year we did not do so because the processes depend on input files which could not easily be attached to these processes on myExperiment. As of RapidMiner 5.1.11 we have a new way of handling files making it easier to link RapidMiner processes against data files on the Web (more on this in this blog to come soon). Therefore, all data files are uploaded to Rapid-I webspace now, and the processes are also on myExperiment bundled in a pack .

The 2011 challenges were quite fun and were dealing with Hobbits, Vodka, and our latest, brand new product: RapidDraw. The processes are quite instructive and are worth playing around with. With the RapidMiner Community Extension you can download the processes directly from myExperiment into RapidMiner (just search for RCOMM). Alternatively, view the pack description on myExperiment.

FunData Mining 23 Mar 2011
Predictive Analytics and Cricket by Ingo Mierswa Comment (1)

I am not really deep into Cricket myself. However, I found this interesting blog entry which discusses some reasons for successful cricket games discoverey by data mining. It is not hard to tell that the author favors the Indian team :-)

The first thing to do is some basic statistics: How often did the Indian cricket team won in the past against certain other teams? For example, the Indian team won against England in 66% of all occasions during the last 5 years where both teams played against each other. Agains Australia, however, Indian won only in 40% of all those cases.

So the important point is: what were the circumstances under which India had won those 40%?  And here is where RapidMiner was used: the matches were described by attributes like "partnership", "pace bowlers", or "slow bowlers". The resulting decision tree looks like the following:

Decision Tree for Cricket

The model was built on all existing cases between India and Australia from the last 5 years. It is easy to tell that partnerships play the most significant role. In particular, 

  • India need to have 2 significant partnerships worth at least 77 runs
  • If not, the bowlers, specifically pace bowlers, have to step into the breach and take more than 7 wickets

Without any knowledge about cricket, I have hardly any idea what this actually means. I suppose that those two strong partnerships with 77 runs or more are two sets of good batting partners playing well with each other. If you don't have those, it seems that fast bowles taking down the wooden "goals" at least 7 times helps a lot.

This is what data mining is actually about: Finding insights in data without the need of having prior knowledge (of course you have to validate the findings!). The latter is actually missing in the blog post but maybe is part of the full report which can be downloaded on the web site. However, a fun read and a nice data mining application!

Data Mining 7 Dec 2010
Data Mining Map by Ingo Mierswa Comment (0)

Recently we had a discussion about good resources for data mining beginners in our forum. Well, there are a lot of books out there and I am not going to repeat the recommendations from the forum thread here.

However, I would like to add another resource which is quite helpful to understand many important concepts of data mining and how they relate to each other. Check out the data mining map of Dr. Sayad of the University of Toronto:

http://chem-eng.utoronto.ca/~datamining/dmc/data_mining_map.htm

Of course, the texts behind the map are not complete in a sense that you will not need any other resource or that any topic is covered. But it is fun to browse through the concepts and delve deeper and deeper until you finally reach the more sophisticated algorithms.

 

Data Mining Map

Check out this nice training resource and have fun!

 

Open SourceData Mining 17 Nov 2010
Be cautious about open source data mining by Ingo Mierswa Comment (0)

Yesterday I stumbled upon an article called "Be cautious about open source data mining" written by Anh Nguyen about a talk given by Jos von Dongen at the Predictive Analytics World in London. My initial thought was just like "ok, the author is probably just a partner of some proprietary software vendor living great from the sales commissions for the sold licenses".

Hence, I did not expect anything neutral and objective but a completely proprietary-vendor-X-oriented article describing with greatest eloquence why proprietary solution X is so much better than any open source solution. Things like: Those open source solutions are free. They simply cannot work - for exactly this reason. And they are of course a danger not only for the complete IT infrastructure but also for the analyst's mind and of course for the whole enterprise. Which is by the way very likely to break down simply by introducing something they did not paid millions of license fees for. I have actually read enough articles like that before and initially I did not want to give this one another chance.

Since I had to wait for another couple of minutes before a meeting started, I clicked on the link and was deeply surprised. There were a set of theses which were completely reasonable. I liked those and hence I want to comment on them and extend them a bit:

 

"It's free but should be evaluated like any other software"

This is actually nothing new and I fully agree. Of course I like what we are doing here at Rapid-I and personally I think RapidMiner / RapidAnalytics are among the best solutions for almost every aspect of data analysis you can think of. Nevertheless, there are situations where other solutions might be more appropriate. At least there is a chance for this, so you should give all options a try. What did you just say? This is not easy since not all options are delivered as open source solutions? Right. But that's hardly our fault...

 

"It doesn’t matter if the software is free if it takes longer to build, manage and deploy solutions to end users, or if it is unstable, or missing key features. Don’t select just because it is open source”

Again I fully agree.  Choosing a solution simply because it is an open source solution is probably as stupid than avoiding it for exactly that reason. Among the potential drawbacks connected to maintaining the software or software quality, I would like to add that exactly for this reason the successful commercial open source companies like Rapid-I offer their Enterprise Editions. Those editions help to overcome those software issues by providing stabilized releases, higher levels of quality assurance, and full support. If you want a fair comparison, you should go for the now-no-longer-free Enterprise Editions and compare those against proprietary solutions. By the way: from my experience, maintaining a software or worrying about missing features feels exactly the same for open and closed source products. There is no general difference connected to the software per se but to the service quality of the companies.

 

"van Dongen believes that if a business does not have any existing tools for data mining, they should make open source the default option. "

This is the strongest claim and I want to support it. The quintessence here is: if there already is a software solution for data mining, I think the optimal way is not to rip it out of your infrastructure and directly and completely replace it by an open source solution. Think gradually and employ RapidMiner for the next project before stocking up your licenses for the other solution. Or make it the default if you don't have a solution at all and have to get used to a new solution for data mining or business analytics anyway. We experienced all three ways during the last years: moving gradually from a closed-source solution to RapidMiner from project to project, starting with RapidMiner as primary data mining solution right away, and directly replacing the old solution by RapidMiner at once. I must say that the last way was the hardest option for all people involved in those projects. But this is nothing special to open source again but to replacing or migrating between different types of software in general.

 

Oh, and by the way: Another fact I really liked that van Dongen and Anh Nguyen recommended RapidMiner as open source solution for data mining. That made me liking this article even better than I did before anyway :-)

 

Here is a PDF file containing the article if it has been removed from the web.

  • Share/Bookmark
  • Abbonieren Sie unseren RSS Feed!
  • Sehen Sie sich Videos in unserem YouTube Channel an!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Besuchen Sie Rapid-I bei Facebook und werden Sie Fan!
  • Folgen Sie Rapid-I bei Twitter!
  • Lesen Sie den Rapid-I Newsletter