Big Data Analytics with RapidMiner and Radoop an Interview with Dr. Ingo Mierswa and Zoltán Prekopcsák
1. What was the starting point for the partnership between Radoop and Rapid-I?
Zoltan: We have been long time users of RapidMiner and we had great conversations with Rapid-I people at business intelligence conferences and the RCOMM conferences. Both teams have academic background and we share the enthusiasm for data technologies, so it was very easy to get along. When we have developed the first version of Radoop, we have started to communicate more and more. We had many ideas how we can help each other, so we have decided to formalize our partnership.
Ingo: I met Zoltan for the first time at our annual user conference RCOMM which in the year 2010 took place in Dortmund near to the Rapid-I headquarters. Zoltan has presented his work in Big Data Analytics. Although RapidMiner offers several different solutions for working on really large data sets, Zoltan and his Radoop team accepted the challenge to develop an even better solution based on the latest development in the field of big data. The results were and still are very impressive and I am personally very excited about this collaboration.
2. One of the first results of this collaboration is the RapidMiner Extension named Radoop. What does this Extension offer to its users?
Ingo: Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. Radoop closely integrates the highly optimized data analytics capabilities of Hive and Mahout into the user-friendly interface of RapidMiner. This results in a powerful and easy-to-use data analytics solution for Hadoop.
Zoltan: Radoop allows RapidMiner users to access and analyze big data stored in Hadoop clusters. Now it is possible to analyze even terabytes or petabytes of data from the same intuitive interface. You can design ETL and data mining processes that are run on the Hadoop cluster, and you can visualize data samples in RapidMiner. Radoop virtually eliminates the memory limit for RapidMiner and allows it to scale for very large data sets. We have used RapidMiner for many of our data mining projects. It was very easy to use, but some of our projects included very large databases that RapidMiner could not handle. We have started to use complex distributed technologies like Hadoop, but they have proven to be really hard to work with. We wanted to fill this gap with Radoop and provide the power of distributed systems and an easy-to-use interface at the same time.
3. Which impact will this product have for Big Data Analytics?
Ingo: While others talk about big data and how to overcome the related issues, we are happy to already announce the solution for the easy creation of data transformations and analytical processes based on Hadoop. This makes RapidMiner + Radoop the first enterprise-ready solution for Big Data Analytics based on Hadoop worldwide. Most current initiatives target at the infrastructural level of Hadoop. Not so Radoop, it aims at the support of the analyst in his or her everyday work without the need for any coding.
Zoltan: I agree. Big Data tools today are extremely complex, must be manually defined and there is a need for programming skills. Therefore, experts are scarce. Radoop and RapidMiner make it much easier to analyze big data with its graphical drag-n-drop interface to define big data analytics workflows. Many companies already have a couple of big data experts, but they are the only ones who can access and analyze data. Radoop will open up the possibility for many more analysts, even for non-technical people.
Ingo: This is also an opportunity for companies not yet analyzing big data. They can make their first steps in this field without hiring experts specifically for this task due to the simplicity of Radoop. Although impact might be largest for big data novices, Radoop offers so many features and shortcuts for typical tasks that it will also make experts much more productive.
4. How, in general, do you think the Big Data topic will evolve?
Ingo: The desires of the market currently change from descriptive analysis based on traditional methods like OLAP to predictive analysis or even prescriptive analysis. Instead of answering questions like “What happened?” people can expect answers to questions like “What will happen?” or “What is my best option now?” The techniques from these fields require a deep knowledge of methods from statistics AND computer science. As Gartner recently has pointed out, the lack of experts in this field is a large bottleneck which holds companies back from embracing these new methods and software tools offering them. Radoop, as the first full solution which simplifies advanced analytics based on Hadoop, is going to change this completely.
Zoltan: There is a huge hype around big data and we will see many successes as well as many failures in the coming years. Analysts need to be careful not to sacrifice quality for quantity. More data is not necessarily better. People will need to understand that big data tools like Hadoop only provide the infrastructure and they still need to work out the best use-cases for their business. Big Data has a huge potential, but it is not a magic wand that solves every problem.
Ingo: Being a data analyst using techniques from data mining and text mining now for almost 15 years, Big Data is actually not a new trend to me. Big Data without Analytics is not helping at all. The analytical results are important to identify new business opportunities or threats in advance. Hence, I am of course very happy that the need for analytics is now widely recognized as one of the most important topics for future IT.
5. What do you think are the big challenges for Big Data?
Zoltan: I think the two most important challenges are the complexity of the current tools and the lack of people who can operate them. There is a major shortage of talent, both analytical experts and data-savvy managers, which makes it harder to succeed with big data projects. I believe that Radoop and RapidMiner do a good job in reducing the complexities associated with big data, so more people can access and analyze large data sets now.
Ingo: Another important aspect about Big Data is the change from structured data to semi-structured, poly-structured, and even completely unstructured data. The data is no longer part of a data warehouse with a structured data mart but is distributed over several places, sometimes not even longer in a tabular format at all. Unstructured data like text collections pose a specific challenge for big data analytics.
6. You mentioned advanced analytics beside large data volumes and poly-structured data. What can RapidMiner already offer to its users?
Ingo: RapidMiner is today the most feature-rich solution for advanced analytics on the market. Talking about text data: we offer alone seven different flavors of Support Vector Machines, which are especially powerful for text classification tasks. Most other solutions on the market to not even offer a single version of this powerful learning technique. The same is true for other methods: in total, RapidMiner offers more than 250 methods for data modeling and hundreds of operations for data transformation. And Radoop now adds new operations for accessing data from Hadoop and using Hadoop clusters for calculations and data transformations.
7. Ok, back to Radoop. What is the major advantage of Radoop compared to other tools for Big Data Analytics?
Zoltan: I think that one of the most important advantages is being tightly integrated to the world-leading data mining tool. RapidMiner has a clean and intuitive interface and a data flow philosophy that we have successfully extended for big data. The tight integration of RapidMiner and Radoop allows users to run distributed and in-memory analyses even in the same process, with the same interface. This is a very powerful package which is not offered by anybody else.
8. Why should companies give this combination a try?
Zoltan: For RapidMiner users, it is very fast to learn and start using Radoop. It is a natural way for them to access and analyze their larger data sets. For others, the combination of Radoop and RapidMiner is a full solution for all data sizes and all types of data analytics problems.
Ingo: Exactly! Data sizes are no longer a bottleneck with this solution. Companies who want a fully fledged solution for data integration, transformation, and analysis now get all this in one easy-to-use interface even for the largest possible data sets.
9. Which are the sectors / verticals where Radoop will show the biggest advantages?
Zoltan: Radoop has the biggest advantages where large data sets are common. We see tremendous data growth at web companies like social networks, social games, and also websites with millions of visitors have problems with storing and analyzing their customers’ behavior. They need a scalable solution as they grow rapidly, so the virtually unlimited scalability of Radoop is very appealing for them.
Ingo: We see many companies interested from the finance sector where huge amounts of historical data is available and can be used to improve future performance, especially for better models for credit scoring or churn prevention. We have significant interest from the medical and healthcare sectors and also from telecommunications and retail. Many of these sectors have large data sets for many years now; they just need a simple tool to use it for their advantage.
10. Zoltan, you have showcased Radoop at the RCOMM 2011 for the first time. How did it evolve since then?
Zoltan: When we have presented Radoop at RCOMM in June 2011, it was a technological proof of concept. We have shown that we can integrate RapidMiner with Hadoop with keeping their main advantages and we wanted to see if people get interested. We have received lots of feedback and since then we have concentrated on testing and stabilizing it and making it ready for the most common business use-cases. We have added many new features that our beta testers have been missing, and we have improved compatibility with RapidMiner itself, and the various Hadoop versions that are available.
11. What is planned for the future? Any improvements in mind?
Zoltan: Radoop is still in private beta, and we are targeting a public release in Q2 of 2012. We will add more features for predictive analytics on big data, more compatibility with external systems, and a groundbreaking new feature that I cannot disclose for now. It will likely change the way how companies think about their big data infrastructure.
Ingo: Rapid-I will continue to add new analytical algorithms to RapidMiner and will also follow its path to offer the most flexible and powerful but at the same time easy-to-use solution. The next major release of RapidMiner will for example comprise a new internal data handling and also support for parallel stream handling. These improvements will also be a direct benefit for the users of Radoop.
12. Is the Rapid-I community involved in this development?
Ingo: Many members of the RapidMiner community have helped with testing Radoop and providing feedback for the extension. These insights are very important for the developers and help to improve the software and making it more robust to serve the large amount of scenarios in which Radoop has already been used in.
Zoltan: Radoop is not a community project, but we contribute patches for the open-source tools that we use, including of course RapidMiner and Hadoop. It is very important for Radoop to have a good connection with the community, and we will be co-organizing the next RCOMM conference in Budapest at the end of this summer.
13. What are your future plans for your collaboration in the Big Data field?
Zoltan: One of our short-term goals is to integrate Radoop with the server version of RapidMiner, called RapidAnalytics. Radoop and RapidAnalytics would allow the scheduling of big data processes and collaboration between analysts. That will be again a really powerful combination which can also serve the results of Big Data Analytics within reports or dashboards to end users via the server’s web interface.