Open source software for big data analytics.
No programming required.

HomeContact UsSearchSitemapPrivacy PolicyImprint
  • Deutsch
  • English
Rapid-I. Report the Future. Home Download
Rapid-I Blog
Home Home
Search Search
RSS Feed RSS Feed

 

 

Blog Tags
Login Form





Lost Password?
No account yet? Register
Tag >> Hadoop
Social NetworksRapidMinerRapidMinerRapid-IRadoopRadoopProcessModelingmahoutmachine learningHadoopHadoophackathonhack/reduceEventdatingClusteringClusteringbostonBlogBig DataBig DataAnalysis 27 Nov 2012
Radoop Team wins hack/reduce hackathon in Boston by Giuseppe Taibi Comment (0)

hack/reduce brands itself as Boston's Big Data hacking space. Backed by a who's who of Boston tech powerhouses, ranging from Harvard and MIT to Google and Microsoft, to the State of Massachusetts and top-tier VCs, hack/reduce is located in the historic Kendall Boiler and Tank building that gives the name to the vibrant Kendall Square technology district, brimming with startup excitement.

True to its mission of "helping Boston create the talent and the technologies that will shape our future in a big data-driven economy,” hack/reduce organized its first hackathon on Nov. 17. We at Rapid-I love Big Data so this was a terrific opportunity to mingle with the Boston Big Data community. Rapid-I's popular open source visual environment for data analysis RapidMiner can easily work on Big Data via Radoop, a RapidMiner extension that adds all the necessary operators to the standard set, so working on Big Data is as easy as drag-and-drop, no coding required. In addition to supporting Map/Reduce, Radoop includes a number of Machine Learning operators based on the powerful Mahout open source library. Mahout is known for being powerful, yet hard to use. Thanks to Radoop, working with Mahout is a breeze.

The day began with a tutorial on Hadoop by Greg Lu, a Software Engineer at hopper who is also the Technical Director of hack/reduce. Then teams were formed. The response to our "Big Data hacking without coding" pitch was terrific and our team quickly grew from four to over 20 members. We used Skype to keep everybody on the same page and troubleshoot. That worked great, especially since we had the original developers of Radoop online from Budapest, Hungary. We turned on the video chat and the remote team really felt like being in Cambridge.

Hackathon was great. At some point we had 25 people on Skype. The Radoop team from Hungary supported us during the entire 10 hours of the hackathon. At first, using a visual environment for a hackathon may sound counterintuitive, but in reality our teammates were really happy to be able to work at a higher conceptual level without having to wrestle with capricious code statements. In fact, our Radoop team was only bound by the power of the Hadoop cluster that we were working on. Because of the ease of use of Radoop, everybody was able to experiment with the data sets and the Hadoop cluster. As a result, the cluster was under stress and slowed down while trying to keep up with the number of job requests. The hackathon also helped the Radoop development team uncover a bug that slowed down processing of a clustering algorithm. (The bug is now fixed.)

Our team worked on a 25GB dating profiles database provided by Mate1.com. Other available databases included carbon dioxide measurements, Amazon.com product database, stock market prices, wikipedia and more (full list of Datasets available on the hackathon wiki). We were interested in performing cluster analysis to explore the similarities among user profiles. The Mate1 user profile attributes included age, gender, eye color, smoking habits, dating preferences, astrological signs, physical fitness, political views and many others.

For this task, we applied a K-Means clustering operator to the dataset, then used RapidMiner to create a scatter matrix plot to explore how the profile attributes were related to each other. We found out that most of the members only filled out the minimum number of fields on the profile. Also, for whatever reason, people with the same eye color also identify with the same body type. In almost every comparison we noticed that many people chose not to specify a value for an attribute. People definitely tend to enter the minimum information necessary to create a profile and start browsing other people profiles. One of the frustrations was the fact that the data set was normalized so we did not really know what was the exact meaning for a certain attribute value. Towards the end we started to reverse engineer this by creating our own profile on the Mate1.com website but then we ran out of time.

We also conducted an analysis to verify the "Half Your Age Plus 7 Rule" referring to the age difference among partners that is considered socially acceptable. More specifically, we mined the dating database to answer the question "What is the Oldest / Youngest Person that you are wiling to date?". In an very entertaining presentation, one team member exposed the harsh fact that for Gender "2,” the rule holds generally true, while for Gender "3,” there is a big difference in the form of members in their 20s and 30s willing to date partners much older than the 7+ rule. The database provided did not specify a text label for the gender, only a number, so feel free to guess which is which.

The main sponsor of the hackathon was hopper, a startup focused on redefining travel using Big Data, which is among the founders of hack/reduce.

Other teams also presented interesting work ranging from to a cool iPad app made by Praveen Aravamudham with a spinning earth globe mapping the CO2 emissions around the world, to the analysis of the most used words in Wikipedia (United States is the most used word).

Right after the team’s final presentation, all hackathon participants were given the opportunity to vote for the team that they thought produced the most interesting work. The Radoop team was off to a great start in the polls and led the race all the way until Andree Coude, VP Technology at hopper, declared the voting process over and Radoop team winner.

Now we are figuring out how to make the best use of the award of $1,000/month of computing power at SoftLayer. Stay tuned.

The video of the final presentation is available at: http://www.ustream.tv/recorded/27101415

Boston Team Members:

Sheamus McGovern - CTO, Capital Market Exchange and machine learning blogger

Todd Cioffi - Director, Technical Training, Navis Learning

Joe Rothermich - Data Scientist and Co-Founder, PeopleHedge

Dan Gerlanc - Predictive Analytics and Visualization Consultant and Founder, Enplus Advisors

Daniel Colonnese - WebSphere Managing Consultant, Lighthouse Computer Services

Sridhar Alla - CTO, eIQnetworks

Kleber Gallardo - CEO, Alivia Technology

Giuseppe Taibi - CEO, Rapid-I North America

Budapest Team Members:

Zoltán Prekopcsák - CEO and Co-Founder, Radoop

Péter Hellinger - Senior Software Engineer and Co-Founder, Radoop

Gabor Makrai - Chief developer and Co-Founder, Radoop

 

Photo Gallery

hack/reduce Radoop Team

Team Radoop hacking away

hack/reduce hackathon

RapidMiner process

Radoop Process Using Mahout K-Means Clustering Operator

hack/reduce Hackathon Voting Results

Rapid-I Team

Radoop Team - hack/reduce Hackathon

Social NetworksRapidMinerRapidMinerRapid-IRadoopRadoopProcessModelingmahoutmachine learningHadoopHadoophackathonhack/reduceEventdatingClusteringClusteringbostonBlogBig DataBig DataAnalysis 27 Nov 2012
Radoop Team wins hack/reduce hackathon in Boston by Giuseppe Taibi Comment (0)

hack/reduce brands itself as Boston's Big Data hacking space. Backed by a who's who of Boston tech powerhouses, ranging from Harvard and MIT to Google and Microsoft, to the State of Massachusetts and top-tier VCs, hack/reduce is located in the historic Kendall Boiler and Tank building that gives the name to the vibrant Kendall Square technology district, brimming with startup excitement.

True to its mission of "helping Boston create the talent and the technologies that will shape our future in a big data-driven economy,” hack/reduce organized its first hackathon on Nov. 17. We at Rapid-I love Big Data so this was a terrific opportunity to mingle with the Boston Big Data community. Rapid-I's popular open source visual environment for data analysis RapidMiner can easily work on Big Data via Radoop, a RapidMiner extension that adds all the necessary operators to the standard set, so working on Big Data is as easy as drag-and-drop, no coding required. In addition to supporting Map/Reduce, Radoop includes a number of Machine Learning operators based on the powerful Mahout open source library. Mahout is known for being powerful, yet hard to use. Thanks to Radoop, working with Mahout is a breeze.

The day began with a tutorial on Hadoop by Greg Lu, a Software Engineer at hopper who is also the Technical Director of hack/reduce. Then teams were formed. The response to our "Big Data hacking without coding" pitch was terrific and our team quickly grew from four to over 20 members. We used Skype to keep everybody on the same page and troubleshoot. That worked great, especially since we had the original developers of Radoop online from Budapest, Hungary. We turned on the video chat and the remote team really felt like being in Cambridge.

Hackathon was great. At some point we had 25 people on Skype. The Radoop team from Hungary supported us during the entire 10 hours of the hackathon. At first, using a visual environment for a hackathon may sound counterintuitive, but in reality our teammates were really happy to be able to work at a higher conceptual level without having to wrestle with capricious code statements. In fact, our Radoop team was only bound by the power of the Hadoop cluster that we were working on. Because of the ease of use of Radoop, everybody was able to experiment with the data sets and the Hadoop cluster. As a result, the cluster was under stress and slowed down while trying to keep up with the number of job requests. The hackathon also helped the Radoop development team uncover a bug that slowed down processing of a clustering algorithm. (The bug is now fixed.)

Our team worked on a 25GB dating profiles database provided by Mate1.com. Other available databases included carbon dioxide measurements, Amazon.com product database, stock market prices, wikipedia and more (full list of Datasets available on the hackathon wiki). We were interested in performing cluster analysis to explore the similarities among user profiles. The Mate1 user profile attributes included age, gender, eye color, smoking habits, dating preferences, astrological signs, physical fitness, political views and many others.

For this task, we applied a K-Means clustering operator to the dataset, then used RapidMiner to create a scatter matrix plot to explore how the profile attributes were related to each other. We found out that most of the members only filled out the minimum number of fields on the profile. Also, for whatever reason, people with the same eye color also identify with the same body type. In almost every comparison we noticed that many people chose not to specify a value for an attribute. People definitely tend to enter the minimum information necessary to create a profile and start browsing other people profiles. One of the frustrations was the fact that the data set was normalized so we did not really know what was the exact meaning for a certain attribute value. Towards the end we started to reverse engineer this by creating our own profile on the Mate1.com website but then we ran out of time.

We also conducted an analysis to verify the "Half Your Age Plus 7 Rule" referring to the age difference among partners that is considered socially acceptable. More specifically, we mined the dating database to answer the question "What is the Oldest / Youngest Person that you are wiling to date?". In an very entertaining presentation, one team member exposed the harsh fact that for Gender "2,” the rule holds generally true, while for Gender "3,” there is a big difference in the form of members in their 20s and 30s willing to date partners much older than the 7+ rule. The database provided did not specify a text label for the gender, only a number, so feel free to guess which is which.

The main sponsor of the hackathon was hopper, a startup focused on redefining travel using Big Data, which is among the founders of hack/reduce.

Other teams also presented interesting work ranging from to a cool iPad app made by Praveen Aravamudham with a spinning earth globe mapping the CO2 emissions around the world, to the analysis of the most used words in Wikipedia (United States is the most used word).

Right after the team’s final presentation, all hackathon participants were given the opportunity to vote for the team that they thought produced the most interesting work. The Radoop team was off to a great start in the polls and led the race all the way until Andree Coude, VP Technology at hopper, declared the voting process over and Radoop team winner.

Now we are figuring out how to make the best use of the award of $1,000/month of computing power at SoftLayer. Stay tuned.

The video of the final presentation is available at: http://www.ustream.tv/recorded/27101415

Boston Team Members:

Sheamus McGovern - CTO, Capital Market Exchange and machine learning blogger

Todd Cioffi - Director, Technical Training, Navis Learning

Joe Rothermich - Data Scientist and Co-Founder, PeopleHedge

Dan Gerlanc - Predictive Analytics and Visualization Consultant and Founder, Enplus Advisors

Daniel Colonnese - WebSphere Managing Consultant, Lighthouse Computer Services

Sridhar Alla - CTO, eIQnetworks

Kleber Gallardo - CEO, Alivia Technology

Giuseppe Taibi - CEO, Rapid-I North America

Budapest Team Members:

Zoltán Prekopcsák - CEO and Co-Founder, Radoop

Péter Hellinger - Senior Software Engineer and Co-Founder, Radoop

Gabor Makrai - Chief developer and Co-Founder, Radoop

 

Photo Gallery

hack/reduce Radoop Team

Team Radoop hacking away

hack/reduce hackathon

RapidMiner process

Radoop Process Using Mahout K-Means Clustering Operator

hack/reduce Hackathon Voting Results

Rapid-I Team

Radoop Team - hack/reduce Hackathon

RCOMMRapidMinerHadoop 8 Jul 2011
Big data analytics made easy: Radoop by Ingo Mierswa Comment (0)

Those of you who visited RCOMM 2011 already know about Radoop , the powerful combination of RapidMiner with Hadoop. This make big data analytics easier then ever. I missed the talk myself (shame on me!) but we had a lot of fruitful discussions afterwards and from my point of view this will become the next RapidMiner revolution. Below you will find some information about the project.

What is Hadoop?

Hadoop is is a software framework that supports data-intensive distributed applications. It is based on Google now well-known map & reduce paradigm which makes it an excellent tool for analyzing large data sets. In principle, Hadoop is able to work with thousands of computing nodes on petabytes of data.

 

 

What about Hive and Mahout?

Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.

Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.

You will see below that both frameworks will be tightly integrated with RapidMiner.

What can RapidMiner bring into the game?

Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.

RapidMiner + Hadoop = Radoop

Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.

Here is the presentation of Zoltán Prekopcsák which he made at the RCOMM 2011:

 

 

Right now, a restricted beta phase has started and you can apply for it at http://radoop.eu/ . More information about Radoop can be found at http://blog.radoop.eu/.

  • Share/Bookmark
  • Stay tuned with our RSS feed!
  • Watch videos on our YouTube channel!
  • Rapid Insight / Inside Rapid-I (Blog)
  • Visit Rapid-I on Facebook and become our fan!
  • Follow Rapid-I on Twitter!
  • Read the Rapid-I Newsletter