Imagine all you would have to do for creating a data mining process was to select a data set and specify what you want to do with the data, e.g. predictive modelling. Wouldn't that save a lot of work?
Within the research project "e-LICO", funded by the EU within the 7th Framework Programme, the Intelligent Discovery Assistant (IDA) was developed, and it does precisely that. It comes with its own perspective (marked with the silhouette of a friendly butler) that contains all you need: The repository and the assistant itself. To use it, follow three simple steps:
Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
Select a goal. The most frequent one is probably "Predictive Modelling". All goals have comments, so you see what they can be used for.
Select "Fetch plans" and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.
The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.
You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL. Alternatively, just download it directly and place it in RapidMiner's lib\plugins folder.
Since the workflow planning happens in Prolog, this extension automatically installs a Prolog engine (XSB Prolog plus Flora 2). It will do so when it first starts. These can only be installed into a specific directory, so you must run RapidMiner as administrator when using the extension for the first time. (On Windows, righ-click and "Run as administrator").
If you try out the extension, we ask you to participate in the user survey so we can keep improving the extension. You can easily open the survey by installing the extension and clicking on the third button in the toolbar (the one with the letter box).
The IDA was developed as a collaboration mainly between the University of Zurich (Jörg-Uwe Kietz and Floarea Serban) and Rapid-I.
Great news for those of you who are waiting for an official RapidMiner book: we recently made some progress on the long lost manual and below you can find even something new: more information and a call for chapters for the upcoming book about how to use RapidMiner in different application areas.
Editors: Dr. Markus Hofmann, Institute of Technology Blanchardstown, Ireland Ralf Klinkenberg, Chief Business Development Officer, Rapid-I, Germany
RapidMiner has, without a doubt, serious impact in relation to software choice when it comes to data mining and predictive analytics. Thanks to its open source license model, RapidMiner spread quickly and is now deployed by hundreds of thousands of users in more than 60 countries world-wide. It is often referenced as a true competitor when compared to proprietary commercial solutions. However, like for many other open source solutions, a lack of application-oriented documentation is often a barrier to use the software. The proposed book wants to address this issue and lower this barrier by demonstrating how to apply RapidMiner in many relevant areas.
The proposed book will be an introductory book to RapidMiner focusing on use cases to explain the functionality and most frequently used operators. The aim is not to produce another data mining book and certain knowledge of data analysis concepts and techniques can be expected when drafting chapter proposals.
The book will provide high-quality practical articles in relation to use cases that showcase RapidMiner as a leading data mining software. Each Use Case has to be accomponied with a dataset. While reading the chapter the learner can follow and implement the use case in RapidMiner 5.
Recommended Topics and Themes
Original papers on all aspects of data analysis that RapidMiner caters for are invited. Submissions must not duplicate work that any of the authors has published elsewhere or submitted in parallel to any other books, conferences or workshops with proceedings. In addition, it is not always necessary to produce the best possible mining process on the data. Instead, the aim is to use the data to explain a set of operators in a practical manner (step by step process).
Possible topics covering all aspects of data mining may include (but are not limited to):
Researchers and practitioners are invited to submit on or before December 31, 2011, a 2 to 3 page manuscript proposal clearly explaining the use case of the proposed chapter and the operators that will be introduced. Authors of accepted proposals will be notified by January 31, 2012. The following should be kept in mind:
The proposed project should include a sample mining process.
You need to submit your Curriculum Vitae with the chapter proposal.
The data needs to be publicly available so that future readers of the book can reproduce the use cases.
The aim is not to produce the perfect process but to use and explain an appropriate number of operators.
Chapter proposals can be submitted as MS Word or PDF file.
Full chapters are expected to be submitted by May 31, 2012. All submitted chapters will be reviewed by at least two reviewers. Various publishing strategies and publishers are currently considered.
Important Dates
Manuscript proposal for book chapter (2-3 pages): December 31, 2011 Notification to authors of submitted chapters: January 31, 2012 First Draft of the chapters from authors: May 31, 2012 Reviews back to authors: June 30, 2012 Revised Chapters back from authors: July 31, 2012 Final notification to the authors: August 31, 2012 Final camera-ready chapters from authors: September 30, 2012
Department of Informatics, School of Engineering and Informatics Institute of Technology Blanchardstown (ITB) Blanchardstown Road North Dublin 15 Ireland
As of RapidMiner 5.1.11, we have introduced a new kind of I/O-Object in RapidMiner: File Objects. File Objects are generated by opening local files, URLs, looping over directories or ZIP files, etc. They are then parsed by Read CSV or Read Excel and converted to an example set. All this was possible before, partly by using macros, but it is now much simpler and more flexible.
Foremost, however, it offers a new way of sending input data to a process exposed as a Web service in RapidAnalytics: The body of the HTTP POST request is transformed into such a File Object and can then be parsed as a part of the process. This makes the definition of the input format of a Web service very flexible and provides a simple means to create Web services that classify data tables.
One of the most fun events at the annual RapidMiner Community Meeting and Conference (RCOMM) is the live data mining process design competition "Who Wants to be a Data Miner?" In this competition, participants must design RapidMiner processes for a given goal within a few minutes. The tasks are related to data mining and data analysis, but are rather uncommon. In fact, most of the challenges ask for things RapidMiner was never supposed to do.
In 2010, we had posted the winning processes immediately after the conference. This year we did not do so because the processes depend on input files which could not easily be attached to these processes on myExperiment. As of RapidMiner 5.1.11 we have a new way of handling files making it easier to link RapidMiner processes against data files on the Web (more on this in this blog to come soon). Therefore, all data files are uploaded to Rapid-I webspace now, and the processes are also on myExperiment bundled in a pack .
The 2011 challenges were quite fun and were dealing with Hobbits, Vodka, and our latest, brand new product: RapidDraw. The processes are quite instructive and are worth playing around with. With the RapidMiner Community Extension you can download the processes directly from myExperiment into RapidMiner (just search for RCOMM). Alternatively, view the pack description on myExperiment.
Summary: Perform simple aggregations on your data (like sums, min or max) and show those values with respect to defined groups
Number of Dimensions: 2, one for the grouping and one for the (aggregated) values
Data Types: Numerical, Nominal
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.
Before we start our discussion about the plotters Bars, we will again first have a look:
As you can see, the bar plotter consists of several bars representing values (on the y-axis) for selected groups (on the x-axis). In principle, the bar plotter is very similar to the plotters Pie, Pie 3D, and Ring which we have discussed in a previous blog post . The basic idea of this type of charts is to present a number of numerical value where each value represents a group. There are two typical application areas for this:
You have a data set with two columns, one column with a set of (un-)ordered nominal values and a second one containing a numerical value for each group;
You again have a data set with a nominal and a numerical column, but now you have each nominal value several times in your table. The goal then is to aggregate the numerical values for each group defined by each of the nominal labels.
It is important to see that in the first case, each nominal value only occurs once and hence there is no need for any calculation on the numerical values. In the second case, you usually would like to perform simple aggregations on your numerical data (like sums, min or max) or at least to calculate the count of your nominal values for each group. Hence, you would like to show those calculated / aggregated values with respect to the defined groups.
Each (calculated) number will be presented by a bar where the height of the bar corresponds to the absolute value. This differs from the Pie charts, where each slice represents the relative amount the number builds of the total sum. Look at the example above, where we used the famous Iris data set and where you can see the different average values for attribute "a3" with respect to the three groups defined by the labels / classes.
As always, you can find a list of settings on the left. The first setting is the Group-By Column. This will typically be a nominal-valued column from your data set which defined the groups into which the data set will be divided and presented by the elements of the chart. The setting Legend Column changes the labels at the bars to the values of the selected column. Since the only useful option is None or the grouping column, it can be ignored in most cases and will probably be removed in one of the next versions anyway.
The next important setting is the Value Column. Here you can select the usually numerical column which is used for value calculation. If you only have one row for each nominal value in the grouping column, you most often already have aggregated values ready for displaying. In other cases, you will have to define a matching Aggregation function, for example the sum or average of the values in each group. There are two additional settings which can be used to further fine-tune the plotting: Absolute Values means that only absolute values of the value column are used as input for the aggregation function. And the setting Use Only Distinct means that each value only is used exactly once in the aggregation, i.e. additional equal values are ignored.
The next setting allows to rotate the labels on the x-axis by 90 degree which allows to read longer labels or prevent label overlapping in case of large amounts of groups. Finally, you can define the orientation of the bar plotter, i.e. if the bars should be displayed vertically (default) or horizontally.
Neil's new video series will again help a lot of RapidMiner users including beginners as well as more experienced analysts since it covers one of the most important aspects of data analysis: loading and transforming the data into a format which is most suitable for analysis. It is widely known that this type of data preparation easily takes up to 90% of the total efforts you put into analysis.
Neil has produced four again really great videos covering some of the most important aspects of ETL (Extract, Transform, and Load) and how this can be done with RapidMiner. Here we go:
Here is the first video, please find the rest in Neils blog (see links above):
Please visit the Vancouver Data Blog for more information and please leave some comments for Neil - he will appreciate your thanks!
We are sure that we speak for the many users out there when we thank you, Neil, for putting these efforts into producing those videos - they are certainly helping a lot!
Summary: Perform simple aggregations on your data (like sums, min or max) and show those values with respect to defined groups
Number of Dimensions: 2, one for the grouping and one for the (aggregated) values
Data Types: Numerical, Nominal
This is the next post of a series describing all RapidMiner plotters in detail. A list of the plotters discussed so far can be found at the end of this article including the links to them. Since many options and controls of these plotters are also relevant for the one discussed here - as well as for many other plotters - I recommend to check out the first parts of this series before reading this one.
Before we start our discussion about the plotters Pie, Pie 3D, and Ring, we will again first have a look:
The three plotters Pie, Pie 3D, and Ring are very similar to each other. We will demonstrate all plotter functions with the Pie chart and show screenshots for the other two plotters later on. The basic idea of this type of charts - which also include Bar charts which will be discussed in the next part of the series - is to present a number of numerical value where each value represents a group. There are two typical application areas for this:
You have a data set with two columns, one column with a set of (un-)ordered nominal values and a second one containing a numerical value for each group;
You again have a data set with a nominal and a numerical column, but now you have each nominal value several times in your table. The goal then is to aggregate the numerical values for each group defined by each of the nominal labels.
It is important to see that in the first case, each nominal value only occurs once and hence there is no need for any calculation on the numerical values. In the second case, you usually would like to perform simple aggregations on your numerical data (like sums, min or max) or at least to calculate the count of your nominal values for each group. Hence, you would like to show those calculated / aggregated values with respect to the defined groups.
The charts Pie, Pie 3D, and Ring are different to almost all other types of charts: there is no background or scales involved. Instead of that, each (calculated) number will be presented by a slice of the pie where the area of the slice corresponds to the relative amount the number builds of the total sum. Look at the example above, where we used the famous Iris data set and where you can see the different average values for attribute "a3" with respect to the three groups defined by the labels / classes.
As always, you can find a list of settings on the left. The first setting is the Group-By Column. This will typically be a nominal-valued column from your data set which defined the groups into which the data set will be divided and presented by the elements of the chart. The setting Legend Column changes the labels at the slices to the values of the selected column. Since the only useful option is None or the grouping column, it can be ignored in most cases and will probably be removed in one of the next versions anyway.
The next important setting is the Value Column. Here you can select the usually numerical column which is used for value calculation. If you only have one row for each nominal value in the grouping column, you most often already have aggregated values ready for displaying. In other cases, you will have to define a matching Aggregation function, for example the sum or average of the values in each group. There are two additional settings which can be used to further fine-tune the plotting: Absolute Values means that only absolute values of the value column are used as input for the aggregation function. And the setting Use Only Distinct means that each value only is used exactly once in the aggregation, i.e. additional equal values are ignored.
The last possible setting, which is only available for Pie and Ring but not for Pie 3D, is the definition of so-called Explosion Groups. You can here select one or several of the possible groups and move them out of the rest with the slider Explosion Amount. This can help to highlight selected groups as shown in the first picture above.
It was quite calm in the Rapid-I blog during the last weeks, sorry for that... It's vacation time and those of us who have to stay are quite busy these days.
In the meantime, you might be interested in an interview given by Simon and myself to Ajay Ohri of DecisionStats . We are talking about the new Rapid-I marketplace and new extensions , big data analytics, hadoop, and mobile computing for business analytics.
Those of you who visited RCOMM 2011 already know about Radoop , the powerful combination of RapidMiner with Hadoop. This make big data analytics easier then ever. I missed the talk myself (shame on me!) but we had a lot of fruitful discussions afterwards and from my point of view this will become the next RapidMiner revolution. Below you will find some information about the project.
What is Hadoop?
Hadoop is is a software framework that supports data-intensive distributed applications. It is based on Google now well-known map & reduce paradigm which makes it an excellent tool for analyzing large data sets. In principle, Hadoop is able to work with thousands of computing nodes on petabytes of data.
What about Hive and Mahout?
Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.
Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.
You will see below that both frameworks will be tightly integrated with RapidMiner.
What can RapidMiner bring into the game?
Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.
RapidMiner + Hadoop = Radoop
Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
Here is the presentation of Zoltán Prekopcsák which he made at the RCOMM 2011:
Right now, a restricted beta phase has started and you can apply for it at http://radoop.eu/ . More information about Radoop can be found at http://blog.radoop.eu/.
The second day started with another invited talk, namely Matthias Reif of the Deutsche Forschungszentrum Künstliche Intelligenz (DFKI) talking about Towards Next-Generation Data Mining. Matthias has presented very interesting insights about new trends in data analysis, including the data-driven recommendation of classifiers and the prediction of classifier accuracy and resource consumption. He depicted the integration of those techniques into server-based solutions like RapidAnalytics which will be the next step towards a collaborative data analysis in the cloud. Very fascinating! Matej Mertik of the Faculty of Information Studies in Novo mestothen presented an application of RapidMiner in the medial domain. I must admit that I did not fully get the connection between feature selection and the presented game of life approach but I am sure that we will get a chance to sort these things out later on. The session was concluded by Andrew Chisholm of the ITB with a talk about possibilities of cluster evaluations. This was really a great talk – within 30 minutes Andrew has perfectly explained his route through the pitfalls around unsupervised data analysis on a real-world problem. Andrew is an experienced speaker and told a great story with many nice ideas behind – it was really a pleasure to listen to him.
The second session on this day covered new Extensions for RapidMiner and RapidAnalytics. The first talk of Radim Burget of the Brno University of Technology discussed their new Image Mining Extension which is already available on out marketplace (see below). It looks great and I will certainly give it a try soon! Afterwards, Milos Jovanovic of the University of Belgrade presented a combination of their WhiBo toolkit presented last year with a genetic programming approach. The result is an optimized decision tree composed of the single steps and sub-algorithms known from different decision trees and their implementations. This is pretty close to some ideas of my masters and PhD thesis so I very much liked this idea (go guys and make it multi-objective next!) ;-)
Simon Fischer presented new Extensions and the Rapid-I Marketplace.
Simon Fischer of Rapid-I concluded this session with an overview of upcoming RapidMiner Extensions and new features which will be release during the next weeks and months, including the new operator recommender . Simon has also presented the new marketplace (http://marketplace.rapid-i.com ), which serves as a central store for RapidMiner Extensions and analytical algorithms. Simon then showed some of the business analytics features of RapidAnalytics, namely the pixel-precise report designer and the integration of analytical results into interactive web-based reports.
The last session on the second day covered text and web mining. Felix Jungermann of the TU Dortmund presented new techniques for handling tree structures in RapidMiner. He showcased these extensions for information extraction and relation detection. Bruno Ohana of the ITB in Dublin then presented a hot topic right now: sentiment analysis and opinion mining – of course done with RapidMiner. This was an interesting talk and a comparison to other approaches demonstrated the very high quality of the results. The last talk of the conference by Clemens Forster of the Vienna University of Economics and Business also covered sentiment analysis in customer feedbacks.
Live music in the Temple Bar.
We made a trip to the Temple Bar district afterwards and visited some of the most famous pubs in Dublin. I tasted strawberry beer (well, interesting…) and had listened to good music. It was almost a miracle that more than 20 participants managed to visit the certification exam on Friday after this evening ;-)