The Rapid-I team keeps on mining and we excavated two great books for our users. The first one, Data Mining for the Masses by Matthew North, is a very practical book for beginners and intermediate data miners, whereas The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman provides a deep insight into the mathematical models driving the heart of every data analysis. It is not really hot off the press, but has not lost its glamour since the release of the first edition a couple of years ago.
The book is targeted at readers with a statistical, mathematical or informatics background who want to understand not only how to use an operator in RapidMiner, but also why it works. The reader should not be afraid of mathematical formulas, but he will be rewarded by a decent understanding of many methods implemented in RapidMiner and of the connections and inter-relationships between different learning algorithms: what do decision trees and rule learning algorithms have in common? Should you try an SVM if k-NN fails on your data?
The Elements of Statistical Learning can be considered a standard book used in many data mining lectures around the world, which may be attributed to the fact that it does not just contain all the detailed information, but also presents them with relatively simple explanations - keeping in mind of course, that understanding complex topics will always require a whole lot of effort. The book is downloadable from the author's website.
Data Mining for the Masses, on the other side, takes a practical approach and, as the name implies, aims at a broader range of readers. Those of you who have visited this year's RCOMM already had the opportunity to follow the presentation of the book by the author himself: Matt's comprehensive book gives a detailed and profound introduction to data mining. All major concepts of data mining are covered in a well structured manner using real-life examples, most of which are solved completely with RapidMiner.
Actually, the book begins one step before the data analysis and explains the meaning of data mining itself, and also does not leave out ethical concerns the responsible data miner should keep in mind. Because of the easy style of writing and the good examples, this book is suited not only for IT professionals and college students who want to take a deeper look onto data mining, but for anyone who wants to learn how to get the most out of their data.
This year's RCOMM live data mining challenge, "Who wants to be a data miner?", was a tricky yet fun task for the competitors. The task was to (partially) solve a Sudoku puzzle with RapidMiner. The solution shows that you can achieve virtually any data analysis task you can think of only by using standard RapidMiner processes.
The input data set consisted of examples of the form (x,y,v) where x and y indicated the column and row inside the puzzle and v was the number predefined at this cell. The path to the solution was split into three subtasks:
Task one was to generate the space of all combinations of cells and numbers that were possible if there were no predefined numbers in the Sudoku.This task could be solved by starting with a simple data set only containing the numbers one to nine and applying two Cartesian Product operators to generate all combinations of these numbers for x, y, and v. In addition to that, we also generate a new attribute z, which indicates the 3x3-sub-table in which the cell (x,y) lies, using a Generate Attributes operator.
This additional attribute z is useful for Subtask 2 which was to eliminate all combinations which are impossible, given a single predefined cell value from the input data set. This was possible by using a combination of a Generate Attributes operator and a Filter Examples operator to identify those combinations (from the set of all combinations generated in Subtask 1) where v and at least one of x, y, or z match a number defined in the input.
The resulting process of Subtask 2 could be re-used in Subtask 3 to eliminate all combinations whose impossibility could be inferred from looking at all predefined cell values in the input data set. This could be achieved by using a Loop Examples operator to iterate over all cells and using a nested Execute Process to re-use the process generated in Subtask 2. Finally, by looking at those cells where only one possible number remains,we can identify a new value that can be inserted into the Sudoku for sure. These cells can be identified by using an Aggregate operator to group the remaining possibilities by x, y, and z and find those groups with a count of exactly one.
As a bonus process, we can repeat the process from Subtask 3, iteratively Appending the inferred numbers to the predefined ones. Thus, the data set grows up to a size of 81 which means the 9x9 Sudoku is complete. Finally, we use the Pivot operator and do some polishing to make the result look like this, a completely solved Sudoku:
All the processes mentioned above are accessible from myExperiment as a pack which you can use from within RapidMiner by using the Community Extension. The processes can be directly opened in RapidMiner, but when saving make sure to name them as they are named in myExperiment since an Execute Process operator in a later process may expect it under this name. Process 0 in this pack downloads the initial data from the Web.
The third RapidMiner Community Meeting and Conference (RCOMM 2012) is quickly approaching and we are very excited about a great program full of talks, success stories, and demonstrations. The RCOMM 2012 will be held in at the Budapest University of Technology and Economics (BME), Budapest, Hungary on August 28 thru 31, 2012.
Normal registration rate ends on August 13th so we recommend to register now to make use of the granted discounts!
What to expect?
RCOMM 2012 offers more than 20 presentations, a social program, and our famous game show "Who wants to be a data miner?" The presentations include:
Mining Machine 2 Machine Data (Katharina Morik, TU Dortmund University)
Handling Big Data (Andras Benczur, MTA SZTAKI)
Introduction of RapidAnalytics at Telenor (Telenor and United Consult)
Presentations aim and practitioners using or extending RapidMiner for commercial or scientific use. Topics include analysis processes, use cases, success stories, best practice recommendations, or descriptions of software packages building upon or extending RapidMiner and RapidAnalytics.
Another important highlight of the conference will be the presentation of the new book "Data Mining for the Masses" by Matthew North from Washington & Jefferson College making use of RapidMiner.
We are happy to present you the third RapidMiner Community Meeting and Conference (RCOMM 2012). The RCOMM 2012 will be held in at the Budapest University of Technology and Economics in Budapest, Hungary on August 28 thru 31, 2012.
Last years' RCOMMs have been a great success with lots of participants, many great talks, and workshops surrounding the conference. RCOMM 2012 intends to intensify the community life again and strengthen the RapidMiner network by bringing together users and developers of RapidMiner from all backgrounds. Presentations can be about applications (processes, use cases, best practice recommendations) or descriptions of software packages building upon or extending RapidMiner.
Presentations aim at researchers and practitioners using or extending RapidMiner for scientific or commercial use. Topics include analysis processes, use cases, success stories, best practice recommendations or descriptions of software packages building upon or extending RapidMiner. Learn more about how to submit a paper to RCOMM 2012 at
One of the most fun events at the annual RapidMiner Community Meeting and Conference (RCOMM) is the live data mining process design competition "Who Wants to be a Data Miner?" In this competition, participants must design RapidMiner processes for a given goal within a few minutes. The tasks are related to data mining and data analysis, but are rather uncommon. In fact, most of the challenges ask for things RapidMiner was never supposed to do.
The 2011 challenges were quite fun and were dealing with Hobbits, Vodka, and our latest, brand new product: RapidDraw. The processes are quite instructive and are worth playing around with. With the RapidMiner Community Extension you can download the processes directly from myExperiment into RapidMiner (just search for RCOMM). Alternatively, view the pack description on myExperiment.
Those of you who visited RCOMM 2011 already know about Radoop , the powerful combination of RapidMiner with Hadoop. This make big data analytics easier then ever. I missed the talk myself (shame on me!) but we had a lot of fruitful discussions afterwards and from my point of view this will become the next RapidMiner revolution. Below you will find some information about the project.
What is Hadoop?
Hadoop is is a software framework that supports data-intensive distributed applications. It is based on Google now well-known map & reduce paradigm which makes it an excellent tool for analyzing large data sets. In principle, Hadoop is able to work with thousands of computing nodes on petabytes of data.
What about Hive and Mahout?
Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.
Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.
You will see below that both frameworks will be tightly integrated with RapidMiner.
What can RapidMiner bring into the game?
Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.
RapidMiner + Hadoop = Radoop
Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
Here is the presentation of Zoltán Prekopcsák which he made at the RCOMM 2011:
The second day started with another invited talk, namely Matthias Reif of the Deutsche Forschungszentrum Künstliche Intelligenz (DFKI) talking about Towards Next-Generation Data Mining. Matthias has presented very interesting insights about new trends in data analysis, including the data-driven recommendation of classifiers and the prediction of classifier accuracy and resource consumption. He depicted the integration of those techniques into server-based solutions like RapidAnalytics which will be the next step towards a collaborative data analysis in the cloud. Very fascinating! Matej Mertik of the Faculty of Information Studies in Novo mestothen presented an application of RapidMiner in the medial domain. I must admit that I did not fully get the connection between feature selection and the presented game of life approach but I am sure that we will get a chance to sort these things out later on. The session was concluded by Andrew Chisholm of the ITB with a talk about possibilities of cluster evaluations. This was really a great talk – within 30 minutes Andrew has perfectly explained his route through the pitfalls around unsupervised data analysis on a real-world problem. Andrew is an experienced speaker and told a great story with many nice ideas behind – it was really a pleasure to listen to him.
The second session on this day covered new Extensions for RapidMiner and RapidAnalytics. The first talk of Radim Burget of the Brno University of Technology discussed their new Image Mining Extension which is already available on out marketplace (see below). It looks great and I will certainly give it a try soon! Afterwards, Milos Jovanovic of the University of Belgrade presented a combination of their WhiBo toolkit presented last year with a genetic programming approach. The result is an optimized decision tree composed of the single steps and sub-algorithms known from different decision trees and their implementations. This is pretty close to some ideas of my masters and PhD thesis so I very much liked this idea (go guys and make it multi-objective next!) ;-)
Simon Fischer presented new Extensions and the Rapid-I Marketplace.
Simon Fischer of Rapid-I concluded this session with an overview of upcoming RapidMiner Extensions and new features which will be release during the next weeks and months, including the new operator recommender . Simon has also presented the new marketplace (http://marketplace.rapid-i.com ), which serves as a central store for RapidMiner Extensions and analytical algorithms. Simon then showed some of the business analytics features of RapidAnalytics, namely the pixel-precise report designer and the integration of analytical results into interactive web-based reports.
The last session on the second day covered text and web mining. Felix Jungermann of the TU Dortmund presented new techniques for handling tree structures in RapidMiner. He showcased these extensions for information extraction and relation detection. Bruno Ohana of the ITB in Dublin then presented a hot topic right now: sentiment analysis and opinion mining – of course done with RapidMiner. This was an interesting talk and a comparison to other approaches demonstrated the very high quality of the results. The last talk of the conference by Clemens Forster of the Vienna University of Economics and Business also covered sentiment analysis in customer feedbacks.
Live music in the Temple Bar.
We made a trip to the Temple Bar district afterwards and visited some of the most famous pubs in Dublin. I tasted strawberry beer (well, interesting…) and had listened to good music. It was almost a miracle that more than 20 participants managed to visit the certification exam on Friday after this evening ;-)
Directly after the RCOMM, I was on vacation and therefore did not get the time to write something about RCOMM 2011 until now, sorry. Here is a review for those who visited the conference or who want to learn more about what happened in Dublin a couple of weeks ago.
RCOMM 2011 was again a huge success! It was great to meet many users of RapidMiner and RapidAnalytics again after the first community meeting in Dortmund last year. Many visitors from RCOMM 2010 also found their way to Dublin and started to build the core community around RapidMiner. So first of all: Thanks to all who attended and especially to those who contributed to the conference by giving a talk about their analysis work or new RapidMiner Extensions.
A couple of participants enjoying the 2011 version of our now-famous game show "Who wants to be a Data Miner?"
Monday was a public holiday in Ireland, and so we started on Tuesday with two parallel half-day training courses, one for beginners and one for more experienced analysts. A second set of parallel training sessions took place on Friday morning directly before the exam.
The actual conference started on Wednesday morning with an invited talk of Prof. Dr. Fionn Murtagh, who is the director of the Science Foundation Ireland and an experienced data analyst. He pointed out the usefulness of ultrametrics for clustering and exemplified this through a wide range of case studies. These included the Colombian social violence between 1990 and 2004 as well as some very interesting insights into optimal movie plots. Fionn is a great scientist and I enjoyed his talk very much. Matko Bošnjak and Nino Antulov-Fantulin of the Ruder Boškovic Institute then described analysis processes for recommendation systems in RapidMiner. They presented ready-to-go workflows which can simply be used or easily adapted to own situations and I am sure that many users will find those really useful. The templates are available on myExperiment with the RapidMiner Community Extension . They also pointed out the currently running data analysis challengeon TunedIT for recommender systems. Still 11 days left – you should consider participation! Beside the 5500 Euro price money, Rapid-I is also sponsoring a free trip to next year’s RCOMM 2012 for the best RapidMiner process!
I unfortunately missed the next session since I have a business event in parallel. Benjamin Schowe of the TU Dortmund presented his work about feature selection methods in RapidMiner and, afterwards, Marcin Blachnik of the University of Bielsko-Biala presented a new Extension for instance selection. I really was interested also in the next talk about Radoop , a combination of RapidMiner with the map & reduce framework Hadoop, but for a second time already I missed the talk of Zoltan Prekopcsak of the Budapest University of Technology and Economics. Nothing personal, Zoltan, and I am sure that we will collaborate in future anyway!
The next session covered various applications of RapidMiner and RapidAnalytics. I was in particular excited to see a first contribution which already has used the new RapidMiner server RapidAnalytics given by Gábor Nagy of the Budapest University of Technology and Economics introducing a stock price prediction system based on RapidAnalytics. Afterwards, Simon Jupp of the University of Manchester presented a combination of RapidMiner with Taverna , a web service based workflow system offering lots of functionality for bioinformatics. The final talk in this session was given by Milan Vukicevic of the University of Belgrade about the classification of electricity customers with WhiBo decision trees.
The next session was divided into two parts: first, I gave a workshop about some basics of loop & macro usage. This was sort of a preparation for this year's game show "Who wants to be a Data Miner?". We got three tasks (hobbit genealogy, drawing a spiral, and distinguishing between vodka and presidents). I won the first one myself (yeah!), the second task was solved by Benjamin and Matko defended his title from 2010 in the third one. Thanks to all participants and congratulations to Benjamin and Matko!
Tomorrow I will add another post describing the second day of the conference and the certification exam. Stay tuned!
We again got high-quality submissions from people worldwide and I am really looking forward to meeting the authors and the other community members for fruitful discussions and exchanging ideas.
Another important date is approaching: the early bird rate is only available until April 29th (this Friday!). Hence, you should register soon for the conference and make use of the up to 30% early-bird discount.
During the RCOMM 2010 , Milan Vukiecevic gave a talk about WhiBo, which is like having a mini-RapidMiner within RapidMiner. They divided well known algorithms like decision trees into their components which can now be almost arbitrarily combined. This allows for the easy development of already known algorithms (like the many different decision tree or k-means variants) but also simplifies the detection of new ones.
Now the authors have translated the WhiBo extension to RapidMiner 5 and published a first video about how to use the extension:
You can find more information about the WhiBo extension and how you can install it at