Social Networks, RapidMiner, RapidMiner, Rapid-I, Radoop, Radoop, Process, Modeling, mahout, machine learning, Hadoop, Hadoop, hackathon, hack/reduce, Event, dating, Clustering, Clustering, boston, Blog, Big Data, Big Data, Analysis
27 Nov 2012
Radoop Team wins hack/reduce hackathon in Boston
by Giuseppe Taibi
hack/reduce brands itself as Boston's Big Data hacking space. Backed by a who's who of Boston tech powerhouses, ranging from Harvard and MIT to Google and Microsoft, to the State of Massachusetts and top-tier VCs, hack/reduce is located in the historic Kendall Boiler and Tank building that gives the name to the vibrant Kendall Square technology district, brimming with startup excitement.
True to its mission of "helping Boston create the talent and the technologies that will shape our future in a big data-driven economy,” hack/reduce organized its first hackathon on Nov. 17. We at Rapid-I love Big Data so this was a terrific opportunity to mingle with the Boston Big Data community. Rapid-I's popular open source visual environment for data analysis RapidMiner can easily work on Big Data via Radoop, a RapidMiner extension that adds all the necessary operators to the standard set, so working on Big Data is as easy as drag-and-drop, no coding required. In addition to supporting Map/Reduce, Radoop includes a number of Machine Learning operators based on the powerful Mahout open source library. Mahout is known for being powerful, yet hard to use. Thanks to Radoop, working with Mahout is a breeze.
The day began with a tutorial on Hadoop by Greg Lu, a Software Engineer at hopper who is also the Technical Director of hack/reduce. Then teams were formed. The response to our "Big Data hacking without coding" pitch was terrific and our team quickly grew from four to over 20 members. We used Skype to keep everybody on the same page and troubleshoot. That worked great, especially since we had the original developers of Radoop online from Budapest, Hungary. We turned on the video chat and the remote team really felt like being in Cambridge.
Hackathon was great. At some point we had 25 people on Skype. The Radoop team from Hungary supported us during the entire 10 hours of the hackathon. At first, using a visual environment for a hackathon may sound counterintuitive, but in reality our teammates were really happy to be able to work at a higher conceptual level without having to wrestle with capricious code statements. In fact, our Radoop team was only bound by the power of the Hadoop cluster that we were working on. Because of the ease of use of Radoop, everybody was able to experiment with the data sets and the Hadoop cluster. As a result, the cluster was under stress and slowed down while trying to keep up with the number of job requests. The hackathon also helped the Radoop development team uncover a bug that slowed down processing of a clustering algorithm. (The bug is now fixed.)
Our team worked on a 25GB dating profiles database provided by Mate1.com. Other available databases included carbon dioxide measurements, Amazon.com product database, stock market prices, wikipedia and more (full list of Datasets available on the hackathon wiki). We were interested in performing cluster analysis to explore the similarities among user profiles. The Mate1 user profile attributes included age, gender, eye color, smoking habits, dating preferences, astrological signs, physical fitness, political views and many others.
For this task, we applied a K-Means clustering operator to the dataset, then used RapidMiner to create a scatter matrix plot to explore how the profile attributes were related to each other. We found out that most of the members only filled out the minimum number of fields on the profile. Also, for whatever reason, people with the same eye color also identify with the same body type. In almost every comparison we noticed that many people chose not to specify a value for an attribute. People definitely tend to enter the minimum information necessary to create a profile and start browsing other people profiles. One of the frustrations was the fact that the data set was normalized so we did not really know what was the exact meaning for a certain attribute value. Towards the end we started to reverse engineer this by creating our own profile on the Mate1.com website but then we ran out of time.
We also conducted an analysis to verify the "Half Your Age Plus 7 Rule" referring to the age difference among partners that is considered socially acceptable. More specifically, we mined the dating database to answer the question "What is the Oldest / Youngest Person that you are wiling to date?". In an very entertaining presentation, one team member exposed the harsh fact that for Gender "2,” the rule holds generally true, while for Gender "3,” there is a big difference in the form of members in their 20s and 30s willing to date partners much older than the 7+ rule. The database provided did not specify a text label for the gender, only a number, so feel free to guess which is which.
The main sponsor of the hackathon was hopper, a startup focused on redefining travel using Big Data, which is among the founders of hack/reduce.
Other teams also presented interesting work ranging from to a cool iPad app made by Praveen Aravamudham with a spinning earth globe mapping the CO2 emissions around the world, to the analysis of the most used words in Wikipedia (United States is the most used word).
Right after the team’s final presentation, all hackathon participants were given the opportunity to vote for the team that they thought produced the most interesting work. The Radoop team was off to a great start in the polls and led the race all the way until Andree Coude, VP Technology at hopper, declared the voting process over and Radoop team winner.
Now we are figuring out how to make the best use of the award of $1,000/month of computing power at SoftLayer. Stay tuned.
The video of the final presentation is available at: http://www.ustream.tv/recorded/27101415
Boston Team Members:
Sheamus McGovern - CTO, Capital Market Exchange and machine learning blogger
Todd Cioffi - Director, Technical Training, Navis Learning
Joe Rothermich - Data Scientist and Co-Founder, PeopleHedge
Dan Gerlanc - Predictive Analytics and Visualization Consultant and Founder, Enplus Advisors
Daniel Colonnese - WebSphere Managing Consultant, Lighthouse Computer Services
Sridhar Alla - CTO, eIQnetworks
Kleber Gallardo - CEO, Alivia Technology
Giuseppe Taibi - CEO, Rapid-I North America
Budapest Team Members:
Zoltán Prekopcsák - CEO and Co-Founder, Radoop
Péter Hellinger - Senior Software Engineer and Co-Founder, Radoop
Gabor Makrai - Chief developer and Co-Founder, Radoop
Team Radoop hacking away
Radoop Process Using Mahout K-Means Clustering Operator
hack/reduce Hackathon Voting Results
Radoop Team - hack/reduce Hackathon
The Rapid-I team keeps on mining and we excavated two great books for our users. The first one, Data Mining for the Masses
by Matthew North, is a very practical book for beginners and intermediate data miners, whereas The Elements of Statistical Learning
by Trevor Hastie, Robert Tibshirani and Jerome Friedman provides a deep insight into the mathematical models driving the heart of every data analysis. It is not really hot off the press, but has not lost its glamour since the release of the first edition a couple of years ago.
The book is targeted at readers with a statistical, mathematical or informatics background who want to understand not only how to use an operator in RapidMiner, but also why it works. The reader should not be afraid of mathematical formulas, but he will be rewarded by a decent understanding of many methods implemented in RapidMiner and of the connections and inter-relationships between different learning algorithms: what do decision trees and rule learning algorithms have in common? Should you try an SVM if k-NN fails on your data? The Elements of Statistical Learning
can be considered a standard book used in many data mining lectures around the world, which may be attributed to the fact that it does not just contain all the detailed information, but also presents them with relatively simple explanations - keeping in mind of course, that understanding complex topics will always require a whole lot of effort. The book is downloadable from the author's website
. Data Mining for the Masses
, on the other side, takes a practical approach and, as the name implies, aims at a broader range of readers. Those of you who have visited this year's RCOMM
already had the opportunity to follow the presentation of the book by the author himself: Matt's comprehensive book gives a detailed and profound introduction to data mining. All major concepts of data mining are covered in a well structured manner using real-life examples, most of which are solved completely with RapidMiner.
Actually, the book begins one step before the data analysis and explains the meaning of data mining itself, and also does not leave out ethical concerns the responsible data miner should keep in mind. Because of the easy style of writing and the good examples, this book is suited not only for IT professionals and college students who want to take a deeper look onto data mining, but for anyone who wants to learn how to get the most out of their data.
This year's RCOMM live data mining challenge, "Who wants to be a data miner?", was a tricky yet fun task for the competitors. The task was to (partially) solve a Sudoku puzzle with RapidMiner. The solution shows that you can achieve virtually any data analysis task you can think of only by using standard RapidMiner processes.
The input data set consisted of examples of the form (x,y,v) where x and y indicated the column and row inside the puzzle and v was the number predefined at this cell. The path to the solution was split into three subtasks:
- Task one was to generate the space of all combinations of cells and numbers that were possible if there were no predefined numbers in the Sudoku.This task could be solved by starting with a simple data set only containing the numbers one to nine and applying two Cartesian Product operators to generate all combinations of these numbers for x, y, and v. In addition to that, we also generate a new attribute z, which indicates the 3x3-sub-table in which the cell (x,y) lies, using a Generate Attributes operator.
- This additional attribute z is useful for Subtask 2 which was to eliminate all combinations which are impossible, given a single predefined cell value from the input data set. This was possible by using a combination of a Generate Attributes operator and a Filter Examples operator to identify those combinations (from the set of all combinations generated in Subtask 1) where v and at least one of x, y, or z match a number defined in the input.
- The resulting process of Subtask 2 could be re-used in Subtask 3 to eliminate all combinations whose impossibility could be inferred from looking at all predefined cell values in the input data set. This could be achieved by using a Loop Examples operator to iterate over all cells and using a nested Execute Process to re-use the process generated in Subtask 2.
Finally, by looking at those cells where only one possible number remains,we can identify a new value that can be inserted into the Sudoku for sure. These cells can be identified by using an Aggregate operator to group the remaining possibilities by x, y, and z and find those groups with a count of exactly one.
- As a bonus process, we can repeat the process from Subtask 3, iteratively Appending the inferred numbers to the predefined ones. Thus, the data set grows up to a size of 81 which means the 9x9 Sudoku is complete. Finally, we use the Pivot operator and do some polishing to make the result look like this, a completely solved Sudoku:
All the processes mentioned above are accessible from myExperiment as a pack which you can use from within RapidMiner by using the Community Extension. The processes can be directly opened in RapidMiner, but when saving make sure to name them as they are named in myExperiment since an Execute Process operator in a later process may expect it under this name. Process 0 in this pack downloads the initial data from the Web.
The third RapidMiner Community Meeting and Conference (RCOMM 2012) is quickly approaching and we are very excited about a great program full of talks, success stories, and demonstrations. The RCOMM 2012 will be held in at the Budapest University of Technology and Economics (BME), Budapest, Hungary on August 28 thru 31, 2012.
Normal registration rate ends on August 13th so we recommend to register now to make use of the granted discounts!
What to expect?
RCOMM 2012 offers more than 20 presentations, a social program, and our famous game show "Who wants to be a data miner?" The presentations include:
- Mining Machine 2 Machine Data (Katharina Morik, TU Dortmund University)
- Handling Big Data (Andras Benczur, MTA SZTAKI)
- Introduction of RapidAnalytics at Telenor (Telenor and United Consult)
- among many others.
Check the full program...
Presentations aim and practitioners using or extending RapidMiner for commercial or scientific use. Topics include analysis processes, use cases, success stories, best practice recommendations, or descriptions of software packages building upon or extending RapidMiner and RapidAnalytics.
Another important highlight of the conference will be the presentation of the new book "Data Mining for the Masses" by Matthew North from Washington & Jefferson College making use of RapidMiner.
Learn more about the full program...
RapidMiner Community Meeting and Conference (RCOMM 2012)
August 28 - 31, 2012
BME, Budapest, Hungary
Register now - last chance for discounted prices!
Looking forward to meeting you all in Budapest!
We at Rapid-I really like our work and give our best to provide you a feature-rich data mining platform. And as you of course all know, the Community Edition of RapidMiner is completely free of charge. Isn't that nice?
But today, we will need YOUR support!
On his really great data mining web site KDnuggets, Gregory asks once in a year his visitors for the data mining tools they have used within the last months. And here is where you come into this game: please vote for RapidMiner in the annual poll of KDnuggets and help us to get more widely known among analysts and researchers worldwide. This, at the end, will of course help to further improve RapidMiner and so you will actually get something back for only a small amount of your time.
Direct Link to the Poll at KDNuggets: http://www.kdnuggets.com/2012/05/new-poll-analytics-data-mining-software-used.html
Things are incredibly simple:
- Visit the web site KDnuggets: http://www.kdnuggets.com/2012/05/new-poll-analytics-data-mining-software-used.html
- Select RapidMiner and / or RapidAnalytics (in the poll box on the bottom right)
- Click on "Submit Vote" and confirm via mail
That's it! It's really easy and costs only a second... And please don't worry: Gregory will not use your mail adress for any other purpose than for this confimation.
Let me end this post and request with a big thank you for participating in this poll as well as for the many comments and feature requests we got during the last years. Things like that help us to improve RapidMiner. So help to spread the word so that we will get more comments in future and further improve it.
I just have stumbled upon a very nice step-by-step introduction to RapidMiner written by Dr. Scott Turner which has been published as a guest post on the blog The Number Crunching Life . Dr. Scott Turner won the Machine March Madness prediction contest last year, and who was the co-winner of the Sweet 16 contest from two years ago. Check out his great blog all about algorithmic prediction of NCAA basketball.
So if you are learning to work with RapidMiner right now or know somebody who just have started, this post definitely might be interesting to you:
Have fun reading this introduction!
We are happy to present you the third RapidMiner Community Meeting and Conference (RCOMM 2012)
. The RCOMM 2012 will be held in at the Budapest University of Technology and Economics in Budapest, Hungary
on August 28 thru 31, 2012
Last years' RCOMMs have been a great success with lots of participants, many great talks, and workshops surrounding the conference. RCOMM 2012 intends to intensify the community life again and strengthen the RapidMiner network by bringing together users and developers of RapidMiner from all backgrounds. Presentations can be about applications (processes, use cases, best practice recommendations) or descriptions of software packages building upon or extending RapidMiner.More information at http://www.rcomm2012.comCall for Papers
Presentations aim at researchers and practitioners using or extending RapidMiner for scientific or commercial use. Topics include analysis processes, use cases, success stories, best practice recommendations or descriptions of software packages building upon or extending RapidMiner. Learn more about how to submit a paper to RCOMM 2012 athttp://www.rcomm2012.com
Hope to see you in Budapest and I am looking forward to your contributions.
This is probably the most exciting announcement of the last months: Radoop and RapidMiner are partners now! Read below more about this disruptive technology for Big Data Analytics.
You want Hadoop? You will love Radoop!
Hadoop has become a defacto standard for working with Big Data. The Hadoop framework supports data-intensive distributed applications which makes it an excellent tool for analyzing large data sets. In principle, Hadoop is able to work with thousands of computing nodes on petabytes of data. The problem is: the creation of those data transformation and analysis jobs means scripting, coding, hacking - which is a real pain in terms of maintenance and integration.
Don't bother with this 90s-style of coding any longer! Radoop (learn more about Radoop in a previous blog post ) offers the best of all worlds: the powerful but yet flexible graphical user interface of RapidMiner together with the power of Hadoop. Radoop closely integrates the highly optimized data analytics capabilities of Hadoop clusters, the distributed data warehouse Hive, and Mahout into the user-friendly interface of RapidMiner. This results in a powerful and easy-to-use data analytics solution for Hadoop.
Everybody talks about Big Data now.
While others talk about big data and how to overcome the related issues, we are happy to already announce the solution for the easy creation of data transformations and analytical processes based on Hadoop. This makes RapidMiner + Radoop the first enterprise-ready solution for Big Data Analytics based on Hadoop worldwide.
Radoop is a disruptive technology. And it is the result of the hard work of two experienced teams around RapidMiner / RapidAnalytics (Rapid-I, Germany) and the Radoop extension (Radoop, Hungary). Radoop's active engagement has been one of the key factors for this revolution in Big Data Analytics. The people of Radoop are highly skilled and committed professionals; this is reflected in the amazing quality of the extension. As a consequence, we are really happy about this partnership and looking forward to even more exciting developments around Big Data Analytics during the next months.
More information in our official press release at http://rapid-i.com/content/view/358/1/
After quite some time of hard development, the Rapid-I team is proud to announce the birth of its latest baby: a brand new plot component presenting you a shiny, powerful and flexible visualization of your data and process results.
The new plotters support bar charts, area charts, scatter and series plots with a single configuration. Instead of preselecting a diagram type from a list of templates the new plotters allow you to freely choose the visualization type of each attribute. You can plot more than one attribute at a time, create additional y-axes, combine aggregated bar charts with scatter plots and add a number of error indicators if you feel the need for it. Enough talking, this is what the new plotters can do for you (of course with your all-time favourite data set):
What do we see in this plot? As you might recognize, the points depict a scatter plot of two attributes of the Iris dataset, namely sepal length versus sepal width, where sepal length is placed on the domain axis (x-axis) and sepal width on the left range axis (y-axis). The colors and also the shapes of the points are chosen accordingly to the label of the data point. This is also represented in the legend on the right.
Talking about the legend, you might want to have a closer look on it. The upper part reveals the plots in this diagram. The first entry labelled sepal length (cm) with the circle in front of it shows us, that the plot consists of single data points, i.e. it is the scatter plot we just talked about. The missing color and quite undefined shape tells us to look at the bottom part of the legend to get the semantics for colors and shapes: moving our attention here we discover that each unique color and shape represent one of the label values iris setosa, iris virginica and iris versicolor.
Now everything left to explain is the bar chart, which is also easily spotted in the legend: it is a histogram of Iris, grouped by label, over the sepal length. Note that the heights of the bars refer to a second range axis on the right.
The attentive reader will have noted that the bars are slightly transparent: this shows another feature of our new plotters - everything is formattable and customizable, starting at customizable presets and gradients for the plot colors, different shapes for each data series, plot and legend background up to the fonts of the title and the axes. What else do you desire? Bars oriented from left to right instead of vertical ones? No problem, two clicks and you are done. Aggregate your data to calculate averages and plot the standard deviation of each data point? No problem, everything is possible :)
The true plotter experts will even be able to beam good old Iris to New York and celebrate the arrival of the new plot engine with a fireworks never seen before in RapidMiner:
Oh yes, this truly is the Iris dataset. Can you guess from the legend what you are seeing?
We hope that we could awake your interest for this new feature. It will be part of RapidMiner 5.2 beta which is expected to be shipped at the end of this week. As usual you will be notified via RapidMiner's auto update about its availability, or you can just download from our website.
From time to time, we post articles about how specific analysis methods work and how those methods and approaches can be done with RapidMiner.
Our colleagues from Simafore, an US-based consultancy company for advanced analytics, also follow this approach and describe many applications of data mining in real-world scenarios together with practical examples done with RapidMiner.
So I thought their blog might be interesting to you, especially for those of you not already familiar with the deepest aspects of data modeling. For most of their blog posts, there is also a white paper explaining more details about the method application and how to perform this with RapidMiner.
Here is a small selection of topics:
A Simple Explanation of Decision Tree Modeling based on Entropies
Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained...)
White Paper: www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/
Logistic Regression for Business Analytics using RapidMiner
Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.
White Paper: http://www.simafore.com/download-ebook-Logistic-regression-articles-digest/
Part 1 (Basics): http://www.simafore.com/blog/bid/57801/Logistic-regression-for-business-analytics-using-RapidMiner-Part-1
Deploy Model: http://www.simafore.com/blog/bid/82024/How-to-deploy-a-logistic-regression-model-using-RapidMiner
Advanced Information: http://www.simafore.com/blog/bid/99443/Understand-3-critical-steps-in-developing-logistic-regression-models
Feature Selection and Linear Regression
There are also two articles about feature selection and linear regression:
White Paper: http://www.simafore.com/Download-ebook-Predicting-Sales-using-linear-regression/
And I am sure, there is more to come. Please visit Simafore's blog at
<< Start < Prev 1 2 3 4 5 6 7 8 9 10 Next > End >>