Pages: [1]
  Print  
Author Topic: Difference between WEKA and RapidMiner  (Read 27020 times)
jjp
Guest
« on: December 09, 2008, 12:13:48 PM »

Hi @all,

I dont know if this is the right category for this topic, but...

Can anyone please tell me what are the main differences between WEKA and RapidMiner and what makes RapidMiner so special?


Thanks in advance
JJP
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #1 on: December 09, 2008, 03:04:49 PM »

Hi,

hmm, this will hopefully not turn out to become just another another RapidMiner vs. Weka discussion. But anyway, here are some links:


* In the following thread, Martin has posted his opinion why he and his company preferred RapidMiner and he pointed out some differences:

http://rapid-i.com/rapidforum/index.php/topic,362.0.html


* And a Google search for Weka and RapidMiner would have give you the following link leading to a statement of mine within the KDnuggets newsletter (I would actually rather not like to be remembered to this discussion  Wink ):

http://www.kdnuggets.com/news/2007/n24/5i.html


* There was also a study done for the Data Mining Cup 2007 showing some differences of RapidMiner compared to other open source data mining solutions as well as proprietary ones:

http://www.prudsys.de/Service/Downloads/bin/DMC2007_schieder_tuchemnitz.pdf


Finally, you could also have a look into our KDD 2006 paper explaining some conceptual ideas behind RapidMiner to see those differences as well. And there were also some threads in the old forum at SourceForge where you can try to find some of these old threads discussing some of the differences.

But in any case: why did you not simply try RapidMiner and find it out yourself? The learning curve might be steep (hey, data mining is a complicated topic after all...) but it's usually worth the effort.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
crappy_viking
Newbie
*
Posts: 20


« Reply #2 on: November 08, 2009, 12:06:27 PM »

Hi @all,

Reading the benchmark between Weka, RapidMiner and KNIME, RM is a bit weak in data preparation. The solution is "datacleaner" here :
http://datacleaner.eobjects.org

Pure Java, query batch optimization, so efficient that sometimes for analyses purposes', you need not data-mining. Clementine has a "Data Quality Audit" showing features' histograms. Which such a tool as "datacleaner", it can go back to bed.

c.v.
Logged
Ralf Klinkenberg
Administrator
Jr. Member
*****
Posts: 70



WWW
« Reply #3 on: November 08, 2009, 12:55:50 PM »

Hi "crappy_viking",

thanks for providing the link to the Data Cleaner project. However, I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?

RapidMiner actually provides significantly more data preprocessing functions and operators than Weka, KNIME, and SPSS Clementine. Feature histograms are also available in RapidMiner and RapidMiner also provides many data cleaning features. If you are not aware of those, I can recommend the RapidMiner training course on Advanced Data Preprocessing for Data Mining with RapidMiner as well as a series of webinars on data preprocessing and data cleaning with RapidMiner.

Best regards,
Ralf
Logged

steffen
Sr. Member
****
Posts: 376



« Reply #4 on: November 08, 2009, 01:34:03 PM »

Hello

I think that ETL tools and Data Mining Tools cannot be compared directly.
This can illustrated (example: kettle ((Pentaho Data Integration) )) how the data flow is organized: In iterators. A process in kettle is a good one, if all steps process only one row at once. This way you can load, process and save the data in small portions instead of loading all at once in the memory (like R, *snicker*). RapidMiner has improved  regarding such tasks, but as far I as see it is still not possible to (e.g.) load data row-wise from a csv-file. I know it is possible to do this by loading data from a database, but then it is not possible to monitor the processed rows, i.e. ...
  • show the current process state of the rows
  • if a row could not be processed without an error, store it in an extra - file to check manually what has happened

If one step does not satisfy this condition (like sorting), the process is getting really slow. Pentaho Corp has bought Weka to include the data mining framework into their application (http://wiki.pentaho.com/display/DATAMINING/Using+the+Knowledge+Flow+Plugin), but frankly: I do not think that this was a good idea, embedding one dataflow philosophy into another one.

Another point is the separation of data management and data analysis. Departments have to talk to each other, but in general I think this are different areas with different targets and responsibilities.

Conclusion:
I would use etl tools for cleaning (which does not include steps like discretization, more steps like duplicate checking) and managing data and shifting data around from one source to another. Shall the DW - specialists take care of it. But if it comes to the point of solving actual data mining problems, I would ask the DW-guys to tell me how to get exactly the data I want and then perform the analysis with RapidMiner.

my (of course subjective) point of view

kind regards,

Steffen
Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
Ralf Klinkenberg
Administrator
Jr. Member
*****
Posts: 70



WWW
« Reply #5 on: November 08, 2009, 02:12:00 PM »

Hello Steffen,

I agree that the way the data flow is organized is a major differentiator between most ETL and data mining tools. And Kettle and Weka do have different flow logics. However, to some extend, RapidMiner offers both flow logics:

  • RapidMiner can load the full data set into memory, if the memory size is sufficient and if you like to operate this way, and perform time-efficient in-memory preprocessing and mining: CSVExampleSource, DatabaseExampleSource, etc. in RapidMiner 4.6 and CSVReader, DatabaseReader, etc. in RapidMiner 5.
  • RapidMiner can alternatively read in the data in chunks, e.g. database line by databse line or file by file, and thereby work on the database or on large document collections or large file collections: CachedDataBaseExampleSource, FileIterator, etc. in RapidMiner 4.6 and corresponding operators and further iterators in RapidMiner 5.
  • In RapidMiner 5, you can also iterate over tables in memory row by row, i.e. iterate over examples.
  • Similar to the CachedDatabaseExampleSource, there is a good chance that RapidMiner will also support line-by-line reading of large CSV files in a future version. But you are right, as of now, this is not directly supported yet.
  • The already existing iteration and branching operators of RapidMiner would then allow to perform the line-wise monitoring of data preprocessing and data cleaning, i.e. different subprocesses could handle correct and incorrect lines, respectively: ProcessBranch and iterators in RapidMiner 4.6 and 5.
  • My personal point of view and conclusion: For most data preprocessing and data cleaning tasks we have encountered so far in our data mining, text mining, web mining, audio mining, and time series analysis and forecasting applications at Rapid-I, RapidMiner provides all data preprocessing, cleaning , and transformation operators necessary (see also our list of references to get an idea of the scope of our projects). Furthermore we keep on extending the preprocessing and ETL capabilities of RapidMiner to meet future challenges and we partner with ETL and data integration tool providers like Talend, Cubeware, Pervasive, etc. to meet any further demands. So, if DataCleaner really offers additional value, it might be a reasonable potential extension to the aforementioned list of ETL tools.

Best regards,
Ralf
Logged

crappy_viking
Newbie
*
Posts: 20


« Reply #6 on: November 10, 2009, 12:14:44 AM »

Hi All
[...]I am not aware of any data cleaning or data preprocessing functionality offered by Data Cleaner that is not already provided by RapidMiner. Could you name any?


Best regards,
Ralf

Datacleaner has very, very handy features for string analysis. Typically pattern analysis in the "profiler" gives aggregates, a kind of OLAP exampleset where each classifier is in fact the string pattern. If your string is the following email "ralf123@hotmail.de", it will be classified in the category "aaaa999@aaaaaaa.aa". It is powerful for two reasons :
- It prepares preprocessing, verifying a few consistency points in your datas
- It gives you the main pattern to use in predicates or in regexps when you do linguistics analysis, NER, indexing, etc...
For each string, "String analysis" can give the number of blank spaces (useful for trimming), Lower/Uppercase, number of words in a string (string vs nominal).

In another profiler (FEBRL, not to give it), you can use distances between words, for phonetic indexing widely spread in data quality :
- soundex, phonex, phonix, metaphone, NYSIIS, etc...
- block/canopy indexing

Other distances, as jaro-winkler or levenstein are available here :
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

All this stuff is indeed string data quality and is not in RapidMiner, except a few algorithms in TextInput (TF-IDF, cosine distance)

c.v.
Logged
Pages: [1]
  Print  
 
Jump to: