Pages: [1]
  Print  
Author Topic: Amazon EC2 and Rapid Miner  (Read 6919 times)
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1196



WWW
« on: May 23, 2008, 11:38:31 PM »

Original message from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=2039039&forum_id=390413

Has anyone tried using Rapid Miner on Amazon EC2 (http://www.amazon.com/gp/browse.html?node=201590011)? If so, any impressions of it?
 
It seems like this may be of value as the computing resources could scale up & down based quickly & easily.
 
Regards,
Eric


Answer by Ingo Mierswa:

Hi Eric,
 
I also would like to hear if someone has experience with using this service...
 
Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
maxdama
Newbie
*
Posts: 11


WWW
« Reply #1 on: September 01, 2008, 08:12:14 PM »

Ingo,

I tested RapidMiner on a Amazon EC2 instance running Ubuntu. Screenshots and results are at my website. It worked after a lengthy setup process but it wasn't very fast at all compared to what I'd hoped. My ultra-portable laptop with its 1.06Ghz CPU and 1GB of memory ran an identical RapidMiner experiment faster than EC2. One variable that wasn't controlled for though was Ubuntu vs Windows. Not sure how much that affected it, maybe you can give me an estimate if you have compared performance on Ubuntu/Linux vs Windows (XP).

The price of EC2 is very low, installing the instance and doing all the setup and testing over two days cost $0.75. Perhaps I should try running a more powerful instance (I was using the most basic one).

Regards,
Max
Logged

maxdama.com
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 290



WWW
« Reply #2 on: September 01, 2008, 09:06:56 PM »

Hi Max,

thanks for sharing your experience with EC2. We have not tried that service yet, hence we really appreciate to hear about its performance, etc. since EC2 is a fascinating approach. What I still would like to know is, how much time it took you to set up the whole thing (until the first run of RapidMiner)? Concerning the choice of instances, I assume that the experiments will not speed up remarkably (assuming you do not use the parallized approaches from the Enterprise version) since - as far as I understand - there are only more virtual cores, not more performant ones. But this scalability seems to be nonetheless an important (but maybe the only?) advantage of the system. So, if you try running RapidMiner again on a bigger instance, it would be very kind of you if you again post some of experiences!

Thanks again,
Tobias
Logged

Tobias Malbrecht
Rapid-I GmbH
maxdama
Newbie
*
Posts: 11


WWW
« Reply #3 on: September 02, 2008, 02:13:43 AM »

Tobias,

It took me one day and one night to set up. Most of that time was spent looking for tutorials and tools to make it easier. The two tools that helped the most were the elasticfox gui and the publicly available Ubuntu image with remote-desktop enabled since I'm inexperienced with a command-line based OS. The tutorial I referenced was the best I could find; following it step-by-step worked smoothly. Installing and running RapidMiner from the remote desktop of the instance was of course very simple. You could probably repeat what I did in 4 hours since you wouldn't need to search for a tutorial. Unfortunately (in my opinion) Amazon Web Services, including EC2, seem to be targeted toward experienced web administrators with a DIY attitude.

I just tried the fastest available 32bit instance at your suggestion and it was substantially faster, about 3x. I’m stuck with 32bit due to the image I’m using. 3x isn’t remarkable but at least it’s superior than my little laptop.

Regards,
Max
Logged

maxdama.com
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 290



WWW
« Reply #4 on: September 02, 2008, 03:54:17 PM »

Hi Max,

thanks again for trying that out again. Well, three times as fast as a 1GHz CPU does not sound magnificent. Nevertheless, if one can scale up the number of cores and the amount of RAM, it is still interesting especially for validation tasks which can be easily parallized. But I do not really understand, why the "normal" extra large instance gets 15 GB of RAM but the high-cpu instance gets only 7GB. Seems unlogical to me. Hm, maybe when there is a lot of time I try using the EC2 myself .. I generally like the idea of a server which you can put on your data mining tasks and which will simply send the results to you .. Wink

Regards,
Tobias
Logged

Tobias Malbrecht
Rapid-I GmbH
Peterbvolk
Newbie
*
Posts: 1


« Reply #5 on: November 05, 2008, 03:10:57 PM »

I know this topic is a bit older but here a few comments: The whole cloud Idea (including EC2 etc) is very very interesting. Especially since people are paying for CPU on demand. Something people would not have dreamed of a while ago. To the RM<->Could problem. RM is, in my opinion, actually no the very best test case for the cloud. RM is implemented in Java and java works it own way around multi core. Also it scales not to well if you turn the number of CPUs to high. Also RM is mostly implemented for linear processing. Not much parallel work are implemented. There are quite a few DM algos that actually consider multiple CPUs. But they are not implemented in RM Smiley SAP e.g. has some quite interesting work on frequent item sets on GPUs and their parallelization (HP workshop on the ICDM this year). But the problem is still that the parallelization needs to be reconsidered within RM. DBMS system have a very efficient parallelization of their execution plans (tree structures) may be this would be a direction for RM to go to support multy CPU system.....who knows.....

Cheers,
Peter
Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 290



WWW
« Reply #6 on: November 06, 2008, 02:29:19 PM »

Hi Peter,

Also RM is mostly implemented for linear processing. Not much parallel work are implemented. There are quite a few DM algos that actually consider multiple CPUs. But they are not implemented in RM Smiley

to which algorithms do you refer in particular? Actually, we have been making and are still making a great effort of parallelizing some parts of our RM Enterprise Edition where distribution among several cores makes sense. We have so far concentrated on validation and optimization algorithms as with these parallelization is extremly straightforward. Additionally, we have started to implement parallel version of learners and already added a parallel decision tree learner. The tests have shown a massive increase in performance on multicore machines which does not really support your thesis that java and multicore applications does not fit well together! Wink If you have particular suggestions and ideas how to drive our attempts forward, you are very welcome to share them with us.

Regards,
Tobias

Logged

Tobias Malbrecht
Rapid-I GmbH
Pages: [1]
  Print  
 
Jump to: