hack/reduce brands itself as Boston's Big Data hacking space. Backed by a who's who of Boston tech powerhouses, ranging from Harvard and MIT to Google and Microsoft, to the State of Massachusetts and top-tier VCs, hack/reduce is located in the historic Kendall Boiler and Tank building that gives the name to the vibrant Kendall Square technology district, brimming with startup excitement.
True to its mission of "helping Boston create the talent and the technologies that will shape our future in a big data-driven economy,” hack/reduce organized its first hackathon on Nov. 17. We at Rapid-I love Big Data so this was a terrific opportunity to mingle with the Boston Big Data community. Rapid-I's popular open source visual environment for data analysis RapidMiner can easily work on Big Data via Radoop, a RapidMiner extension that adds all the necessary operators to the standard set, so working on Big Data is as easy as drag-and-drop, no coding required. In addition to supporting Map/Reduce, Radoop includes a number of Machine Learning operators based on the powerful Mahout open source library. Mahout is known for being powerful, yet hard to use. Thanks to Radoop, working with Mahout is a breeze.
The day began with a tutorial on Hadoop by Greg Lu, a Software Engineer at hopper who is also the Technical Director of hack/reduce. Then teams were formed. The response to our "Big Data hacking without coding" pitch was terrific and our team quickly grew from four to over 20 members. We used Skype to keep everybody on the same page and troubleshoot. That worked great, especially since we had the original developers of Radoop online from Budapest, Hungary. We turned on the video chat and the remote team really felt like being in Cambridge.
Hackathon was great. At some point we had 25 people on Skype. The Radoop team from Hungary supported us during the entire 10 hours of the hackathon. At first, using a visual environment for a hackathon may sound counterintuitive, but in reality our teammates were really happy to be able to work at a higher conceptual level without having to wrestle with capricious code statements. In fact, our Radoop team was only bound by the power of the Hadoop cluster that we were working on. Because of the ease of use of Radoop, everybody was able to experiment with the data sets and the Hadoop cluster. As a result, the cluster was under stress and slowed down while trying to keep up with the number of job requests. The hackathon also helped the Radoop development team uncover a bug that slowed down processing of a clustering algorithm. (The bug is now fixed.)
Our team worked on a 25GB dating profiles database provided by Mate1.com. Other available databases included carbon dioxide measurements, Amazon.com product database, stock market prices, wikipedia and more (full list of Datasets available on the hackathon wiki). We were interested in performing cluster analysis to explore the similarities among user profiles. The Mate1 user profile attributes included age, gender, eye color, smoking habits, dating preferences, astrological signs, physical fitness, political views and many others.
For this task, we applied a K-Means clustering operator to the dataset, then used RapidMiner to create a scatter matrix plot to explore how the profile attributes were related to each other. We found out that most of the members only filled out the minimum number of fields on the profile. Also, for whatever reason, people with the same eye color also identify with the same body type. In almost every comparison we noticed that many people chose not to specify a value for an attribute. People definitely tend to enter the minimum information necessary to create a profile and start browsing other people profiles. One of the frustrations was the fact that the data set was normalized so we did not really know what was the exact meaning for a certain attribute value. Towards the end we started to reverse engineer this by creating our own profile on the Mate1.com website but then we ran out of time.
We also conducted an analysis to verify the "Half Your Age Plus 7 Rule" referring to the age difference among partners that is considered socially acceptable. More specifically, we mined the dating database to answer the question "What is the Oldest / Youngest Person that you are wiling to date?". In an very entertaining presentation, one team member exposed the harsh fact that for Gender "2,” the rule holds generally true, while for Gender "3,” there is a big difference in the form of members in their 20s and 30s willing to date partners much older than the 7+ rule. The database provided did not specify a text label for the gender, only a number, so feel free to guess which is which.
The main sponsor of the hackathon was hopper, a startup focused on redefining travel using Big Data, which is among the founders of hack/reduce.
Other teams also presented interesting work ranging from to a cool iPad app made by Praveen Aravamudham with a spinning earth globe mapping the CO2 emissions around the world, to the analysis of the most used words in Wikipedia (United States is the most used word).
Right after the team’s final presentation, all hackathon participants were given the opportunity to vote for the team that they thought produced the most interesting work. The Radoop team was off to a great start in the polls and led the race all the way until Andree Coude, VP Technology at hopper, declared the voting process over and Radoop team winner.
Now we are figuring out how to make the best use of the award of $1,000/month of computing power at SoftLayer. Stay tuned.
The video of the final presentation is available at: http://www.ustream.tv/recorded/27101415
Boston Team Members:
Budapest Team Members:
Team Radoop hacking away
Radoop Process Using Mahout K-Means Clustering Operator
hack/reduce Hackathon Voting Results
Radoop Team - hack/reduce Hackathon