Poll
Question: Which feature or enhancement would you prefer most for the next major release?
Displaying and editing data flows - 18 (26.5%)
Concentrating on large-scale analysis - 31 (45.6%)
Workspace management including process sets and data repositories - 7 (10.3%)
Allowing own operator groups, group structures and operator tags - 7 (10.3%)
Other (please comment) - 5 (7.4%)
Total Voters: 68

Pages: [1] 2 3
  Print  
Author Topic: Features for RapidMiner 5.0  (Read 17716 times)
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1207



WWW
« on: May 24, 2008, 01:23:34 AM »

Hello all,

we would really like to hear which new features or enhancements you would appreciate the most for the next major release RapidMiner 5.0. Of course we would also be grateful for any other hint how we could further improve RapidMiner. Just vote in the poll above and / or give a short comment.

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
sorenmacbeth
Newbie
*
Posts: 1


« Reply #1 on: July 30, 2008, 05:06:40 AM »

I believe this was already mentioned in a previous thread, but I would like to reiterate the request here:

I would like to see cointegration testing added to rapidminer. The most common test (in my experience) is the augmented dickey-fuller test (ADF).
You can find matlab code for various cointegration test here:

http://www.spatial-econometrics.com/coint/contents.html

Cheers!
Logged
jean-charles
Guest
« Reply #2 on: August 13, 2008, 11:59:05 AM »

Hi Ingo,

I have just written on "chit chat" about kernel ICA.

Now, an interesting feature : a new operator for coupling with matlab/scilab, as in labview. For instance, two parameters for this operator :
- "choose/browse scilab script file"
- "script file : edit"
There would be a few more parameters to map either "exampleset" or "matrices" RM types to matrices/vectors matlab types, or the other way round...assuming matlab or scilab is installed, then matlab is called as a server at running the experiment.
In a prototyping perspective, no matter if any function is redundant with RM operators...

Another feature which would avoid me working on external database : more relational algebra operators (projection, reduction, etc...). Maybe a specific "pentaho/kettle" operator could access it and grant such specific tasks, collecting results at the end ?

Whith such handling operators, one could definitely focus onto algorithms' diversity

Cheers,
   Jean-Charles.
Logged
steffen
Sr. Member
****
Posts: 374



« Reply #3 on: August 13, 2008, 12:46:30 PM »

Hello Jean-Charles

Quote
Now, an interesting feature : a new operator for coupling with matlab/scilab, as in labview. For instance, two parameters for this operator :
- "choose/browse scilab script file"
- "script file : edit"
There would be a few more parameters to map either "exampleset" or "matrices" RM types to matrices/vectors matlab types, or the other way round...assuming matlab or scilab is installed, then matlab is called as a server at running the experiment.
In a prototyping perspective, no matter if any function is redundant with RM operators...

I thought about this too, but:
I am not generally sure what the usecase would be. Matlab (which I know in opposite to Scilab) provides a complete environment for easy data handling, easier (if you are familiar with the programming language) handling than in Java/RapidMiner. But the disadvantage is, that this allows you wild hacking, which will decrease  the reproducability and the comprehensibility of your experiments. One of RMs strenghts is the GUI, especially the operatortree, which increases the mentioned properties.

In my opinion, using Learning-Algorithms in ML from RM or vice versa makes sense, but allowing arbitrary scripts will mess up the process. Two additional notes regarding this:
1.R is also an alternative, because...
  • it is free
  • it has a LARGE community of active developers
  • an interface to Java already exists
Using R in my diploma thesis, I am thinking about writing some kind of extension for RM using the mentioned interface... but I am not sure if I am going to need this or not.

2. Most of the connections to external environments use JNI. I do not have much experience with it, but I think it will be rather slow...(yes, it is written in C, but shifting data from environment to another means multiple transformations of the presentation of the data...which cannot be good)

In this thread (http://rapid-i.com/rapidforum/index.php/topic,99.0.html) it was mentioned to use scripting languages for small data transformations as in Kettle. The main advantage of e.g. JavaScript or Groovy is, that it can be run on the JVM,too. I think this a kind of scripting extensions, which will keep the system "clean".

Quote
Another feature which would avoid me working on external database : more relational algebra operators (projection, reduction, etc...). Maybe a specific "pentaho/kettle" operator could access it and grant such specific tasks, collecting results at the end ?

Pentaho Kettle processes the data row by row, RM en block. I am not sure if this is going to be a problem (at least not if a kettle process is used before or at the end), but I still see the danger of messing up two different concepts...

Quote
Whith such handling operators, one could definitely focus onto algorithms' diversity
I totally agree. The most important lesson I learned in the past year is, that no matter how powerful your (non-script-based) data mining environment will be, their will be always a task where you got to go behind the standard borders and extend...
Referring to the discussion of using an external scripting language, I would break it down to the formula: "Reproducability vs Flexibility"
Joel Spolsky calls it: The Law of Leaky Abstractions (http://www.joelonsoftware.com/articles/LeakyAbstractions.html)

Long text, thank you for reading...

the advocatus diaboli has struck back  Wink

Steffen
« Last Edit: August 13, 2008, 12:48:01 PM by steffen » Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
jean-charles
Guest
« Reply #4 on: August 13, 2008, 03:29:47 PM »

Hi Steffen, Hi Ingo,

OK for "R", seems more sensible than Matlab. Ah, just a point : Matlab is under ownership, while Scilab is GNUGPL-like.

About the "leaky abstraction" of a scripting language : I would like to enhance the role of "mapping the types", it is a bit like "visual declaration". It may require two different methods, one for casting from RM to "external language engine (ELE)", another from ELE to RM.

Examples from Labview :
http://forums.ni.com/ni/attachments/ni/170/334969/1/pb%20LabVIEW.JPG
http://forums.ni.com/ni/attachments/ni/170/335497/1/untitled6.JPG
http://zone.ni.com/devzone/cda/tut/p/id/4854

On the border of the "matlab script node" box, you specify (orange reddish color) which feature corresponds to what...Do you see what I mean ?

Cheers,
  Jean-Charles.
Logged
steffen
Sr. Member
****
Posts: 374



« Reply #5 on: August 13, 2008, 09:35:06 PM »

Hello

Hm, I guess you misunderstood what I mean with "Leaky abstractions". I was not referring to the problem of converting data from one environment to another, but the problem of abstraction in the data mining process. RapidMiner is an abstraction since it provides strictly defined operators, which can be combined to more powerful ones, but all in all they have all a clear definition of what they are doing. This allows Reproducability.

Scripting languages are basic, here are no abstractions provided. You can do anything with them which give you nearly infinite freedom, but since no "atomic units" are defined, reproducability is decreased. Of course you can define such units, but this always leads to "reinventing the wheel" and (as I mentioned above) you will reach most certainly a point where you got to break the abstractions for a special task.

Reproducability: Months after calculating a certain experiment you look at the setup again: You are able to understand what you did quickly and reproduce the results by recalculating the experiments.One may argue that you are able to achieve this with scripts, but keep in mind how easily scripts can be changed in comparison to RM-Operators...and people tend to forget where they have used the script before and so adjust them mindlessly for their current task...months later:  Huh

I hope it is now clear...combining RM and a too powerful scripting language (=language powerful enough for Data Mining on its own) will (in my opinion) mess up the process and decrease the power of RM: Its clearness. Hence a "simple" scripting language like JavaScript for data transformations is enough.

You see, I have really struggled with this...

I am eager to hear more opinions...

Steffen
 
« Last Edit: August 13, 2008, 09:40:17 PM by steffen » Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
jdouet
Newbie
*
Posts: 21



WWW
« Reply #6 on: August 13, 2008, 11:11:07 PM »

Hi Steffen,

The answer to your opinion is in the above poll chart : Assuming that reproducability  Undecided is linked to 'reuse the code and personal tags' bin, I am not therefore sure that it is what users are looking for. They want parallelization, memory efficiency and memory map control. Is that either the case with Javascript or with "R" ?

By definition, a scripting language is very flexible, but is it fast enough ? The only object-oriented scripting language that I know with such requirements is "lush" : http://lush.sourceforge.net/
Just have a look at their "process log" for Octave / interpreted lush in the faq : http://lush.sourceforge.net/faq.html

After that, if it were just for handling datas, I am not sure... Roll Eyes

Any comments ?

Cheers,
  Jean-Charles.
Logged
steffen
Sr. Member
****
Posts: 374



« Reply #7 on: August 14, 2008, 07:33:00 AM »

Good morning

Quote
They want parallelization, memory efficiency and memory map control. Is that either the case with Javascript or with "R" ?

Giving this options R, but I didnt try parallelization with R yet. The question is "Does anyone who is not capable of using a script language (for more than simple transformations) wants to have one in RM or does he/she prefer more operators ?" respectively "How do the people,who are capable of using a powerful scripting language, use RM ?"  My personal subjective answer to the last question is: I use it for coding operators which represent the results of prototyping. 

Quote
By definition, a scripting language is very flexible, but is it fast enough ? The only object-oriented scripting language that I know with such requirements is "lush" : http://lush.sourceforge.net/
Just have a look at their "process log" for Octave / interpreted lush in the faq : http://lush.sourceforge.net/faq.html

@type of language
R is a functional-object-oriented, interpreted language. This sounds crazy, but once you get the concept, it is easy to use
@speed
R is capable of using C - compiled Code, so it can be very fast, too. I must admit that I do not know how fast it is on its own, but I will search the net for some benchmarks...
@lush:
  • latest lush Release was in 2006, latest news message in may 2007. Additionally I was not able to find any linked libraries containing standard data mining algorithmns. Any links ?
  • The speed test is fun. Saying in one breath that "for" is slow in Octave/Matlab and then using it for a speed test ? They better tried this in Octave:
Code:
sum([1:1000000])
    • most important point: It is written in a LISP like syntax. LISP, I mean: LISP. I used this language in one course at a university and still got nightmares Undecided. I still dont get why people are so amazed about this counterintuitive bracket-language

    greetings

    Steffen

    PS: Maybe it could be interesting to start the same discussion on "Analytic Bridge". But I am afraid the members of this site are too "high-level" for this. In the opposite, I do not even know a single thing about SAS,SPSS and stuff.

    Logged

    "I want to make computers do what I mean instead of what I say"
    Read The Fantastic Manual
    Ingo Mierswa
    Administrator
    Hero Member
    *****
    Posts: 1207



    WWW
    « Reply #8 on: August 15, 2008, 12:23:18 PM »

    Hello,

    I understand both sides in this discussion and must say that I would usually prefer people would write operators following all strictly defined guides and design boundaries instead of writing things down to disk, applying their since-10-years-used-and-proven-to-be-the-best-Perl-,-R-,-Mathlab-whatever-script and reload data back to RM. But actually we have noticed a lot of times that people are doing exactly this. Partly, because those analysts already have a solution to a problem in another language which is not available in RM (or not yet found by the user  Wink). Partly, because sometimes it is easier to make a fast hack in some language before bothering with Java, an IDE, things like operators.xml. And finally, some people simple do not speak Java and do not want to learn it.

    So I feel that we have to find a solution for this anyway and we could stick (as you two begun) to the level of power we allow to give and also, which languages / systems we do want to support.

    - the first option is to add a new operator which is able to define the required input and guaranteed outputs and the corresponding variable names and allow arbitrary Java code working on these variables. This is extremely powerful and would directly relate to developing a new operator but without the need of working in an IDE etc. Advatanges: quick and powerful. Disadvantage: no reusability at all beside copy and paste.

    - add (one or several) scripting operators which work on data sets and will also deliver a data set and support a scripting language for this. From a conceptual point of view I agree with Steffen that this will probably not "disturb" the concept of RM. On the other hand, we could add additional other operators as well: noone would of course have to use those...

    - we could think of integrating R like integrating Weka: using the learning schemes available in R. However, there are several drawbacks: there is, as far as I know, no clean way to access all available learning methods. There is even no clean definition of the input format or the format we could expect for the output. This would mean that we would have to transform the in- and output for each new method anew which is, well, more like pain than programming...

    - another drawback is that the algorithms in R are often not fast and / or memory-efficient enough to work on larger data sets. People are working on this but in its current state I am not sure if using R would be feasuble on databases with millions of tupels

    Just a few thoughts. Cheers,
    Ingo
    Logged

    Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
    steffen
    Sr. Member
    ****
    Posts: 374



    « Reply #9 on: August 15, 2008, 11:24:06 PM »

    Hello

    Quote
    - the first option is to add a new operator...
    I was playing around a little with Groovy this way. The absence of an IDE was not comfortable since import statements have to be included anyway....I guess if some is willing to prototype in java or a java-like language, the difference in advantages between a scripting/include code-operator and developing in an IDE is minimal.

    Quote
    - add (one or several) scripting operators which work on data sets and will also deliver a data set and support a scripting language for this. From a conceptual point of view I agree with Steffen that this will probably not "disturb" the concept of RM. On the other hand, we could add additional other operators as well: noone would of course have to use those...

    Hmm, yes. But how many operators do you want to create this way ? I think a good point to start the "scripting experiment" could be a scripting-supported feature generation, which allows theoretically everything by import and has already predefined access methods for the input ExampleSet(s).

    Quote
    - we could think of integrating R like integrating Weka: using the learning schemes available in R. However, there are several drawbacks: there is, as far as I know, no clean way to access all available learning methods. There is even no clean definition of the input format or the format we could expect for the output. This would mean that we would have to transform the in- and output for each new method anew which is, well, more like pain than programming...


    This is sad, but true. As I mentioned above, I will probably use both "environments" together, primarily for the learning algorithms in RM. For this task, a simple conversion from "ExampleSet" to "data.frame" and vice versa will be enough. This should not be too complicated. If I will create those conversions, I will contribute them.
    Beside this: As you already said, there are a lot of fancy scripting languages out there. So making everyone happy is nearly impossible, not to mention the extra work if not even a simple call-interface exists.

    Quote
    - another drawback is that the algorithms in R are often not fast and / or memory-efficient enough to work on larger data sets. People are working on this but in its current state I am not sure if using R would be feasuble on databases with millions of tupels

    Hm, okay  Undecided. Fortunately, this does not play a role for me (for next months  Grin), but (seriously) from this point of view R cannot be the primarily supported language. On the other hand: Which languages are available considering databases of this size ? Fortran and C ? Parallelisation can also do the trick, of course, but: In my point of view prototyping does not include killing computers by performing hacked scripts on millions of datasets, but testing if a strategy works at all. Then in the calculation step I would prefer a strict and clearly defined language without doubts. But this is (rather probable) the view of an unexperienced user.

    Two final remarks:
    • The R efficiency: I believe you (!), but do you have any links or references stating this from a more official side (or maybe a paper?)
    • Conclusion (at this stage?):  Extending RM by an external powerful scripting language is insofar impossible, since there are no clearly defined access points and since two many types of ResultObjects have to be converted. Maybe one can use the wiki to collect helping conversion scripts users have contributed.

    btw: Do you know Sage. I did not try it yet, but it seems as this "environment" collects different open source and commerical "environments" to bring them together under one hood.But since it is based on Python, I guess it is somehow "hacking", too. Wink

    good night

    Steffen
    Logged

    "I want to make computers do what I mean instead of what I say"
    Read The Fantastic Manual
    jcd
    Guest
    « Reply #10 on: August 20, 2008, 03:25:34 PM »

    Hi All,

    Inside the discussion, just two "alien" requirements :
    - Association Rules : There is a widget "minimum confidence", but it would be efficient to choose the index. I'd rather like "minimum lift" or "minimum something"; would it be possible to modify ASR tab to choose which "something" to observe and tune ? Ah, and it seems to have a tiny visual bug. In "graph" mode for ASR, I cannot resize the two frames, thus the ASR drawing is completely compressed on the right because of long IDs' names on the left.
    - "LDD-type" SQL : It could be interesting to have a DatabaseAttributeConstructionWriter, so that "att" files are parsed into a SQL dump file, defining what tables and columns to write inside a DB. Maybe a DatabaseAttributeConstructionReader would be useful too ?

    Cheers,
       Jean-Charles.
    Logged
    Ingo Mierswa
    Administrator
    Hero Member
    *****
    Posts: 1207



    WWW
    « Reply #11 on: August 21, 2008, 11:27:54 AM »

    Hi,

    so we will probably improve the existing operators in four ways:

    - allowing an easier but also more powerful feature generation similar to that already existing but without the drawbacks (prefix and constants are really pain...) - we are actually already working on this one
    - add a scripting based feature selection operator in addition to the value type and name based (regular expression) filters already existing
    - add a scripting based example selection operator similar to the things mentioned in point 2
    - add a generic scripting operator (I don't think it's impossible if you only define enough restrictions) which works on example sets and will deliver an example set for almost arbitrary preprocessing not covered by the existing operators

    So the main questions are: which scripting language should we support? I would suggest one natively supported by the Java scripting engine delivered since Java 6 (which would on the other hand mean that all users who want to use those new operators would have to rely on Java 6). And which restrictions do we have to define in order to ensure that things will work as expected?


    Quote
    The R efficiency: I believe you (!), but do you have any links or references stating this from a more official side (or maybe a paper?)

    Not a paper, sorry. But the annual user conference of R (useR!) happened to take place in Dortmund last week:

    http://www.statistik.uni-dortmund.de/useR-2008/

    In some of the talks and especially in the discussions during the breaks the missing scalability was one of the main topics. Maybe you find some hints in the slides of the talks given on the web site. As I said before: people are working on this...

    Cheers,
    Ingo
    Logged

    Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
    Ingo Mierswa
    Administrator
    Hero Member
    *****
    Posts: 1207



    WWW
    « Reply #12 on: August 21, 2008, 02:17:53 PM »

    Hi Jean-Charles,

    Quote
    - Association Rules : There is a widget "minimum confidence", but it would be efficient to choose the index. I'd rather like "minimum lift" or "minimum something"; would it be possible to modify ASR tab to choose which "something" to observe and tune ? Ah, and it seems to have a tiny visual bug. In "graph" mode for ASR, I cannot resize the two frames, thus the ASR drawing is completely compressed on the right because of long IDs' names on the left.

    Your wish is my command  Wink

    You can select the criterion now for the association rules generator operator, for the table view, and for the graph view. The slider for the minimal value is also better adjusted so you can adjust the minimal value in a "smoother" way.  We also had a look into the "long name" issue: the list on the left provides now also a horizontal scrollbar if necessary. The same applies for the table view for association rules and for the plotter panel, where this also happened from time to time.

    As always, you can access those changes (way before RM 5.0  Grin) from CVS or wait for the next release. Users of the RapidMiner Enterprise Edition will of course get access to these changes with the next automatic update.


    Quote
    - "LDD-type" SQL : It could be interesting to have a DatabaseAttributeConstructionWriter, so that "att" files are parsed into a SQL dump file, defining what tables and columns to write inside a DB. Maybe a DatabaseAttributeConstructionReader would be useful too ?

    Ok, that's actually a bigger one. We are currently checking if we can provide SQL for (at least some) preprocessing models / prediction models so it is likely that something like this will be possible in some future release.

    Cheers,
    Ingo
    Logged

    Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
    jean-charles
    Guest
    « Reply #13 on: August 26, 2008, 03:01:48 PM »

    Hi Ingo,

    Quote
    Your wish is my command  Wink
      Grin A few "wiki stuff" will do the trick.

    Now, playing around with CRF plugin, I wondered how to tag raw text, since it is not really the purpose of NERPreprocessing. I found "GATE" software, already known from Rapid-I team (isn't it ?), written in Java and I thought it could be reused in RapidMiner through a "scripting box" operator. I do not know GATE enough at the moment, but would it be an interesting extension to Text Mining operators, in a way or another ? I feel that all "tagging", "parsing" and "dictionary" resources would be a powerful addon.

    Imagine that in a "TextInput" tree, a few extra operators would do a syntactic analysis and tagging to produce a "rich" ExampleSet...
    What do you think about it ?

    Cheers,
      Jean-Charles.
    Logged
    jean-charles
    Guest
    « Reply #14 on: August 26, 2008, 03:22:12 PM »

    Huh, where is Steffen ? May I invoke "advocatus diaboli" ?  Cheesy
    Logged
    Pages: [1] 2 3
      Print  
     
    Jump to: