Poll
Question: Which feature or enhancement would you prefer most for the next major release?
Displaying and editing data flows - 18 (26.5%)
Concentrating on large-scale analysis - 31 (45.6%)
Workspace management including process sets and data repositories - 7 (10.3%)
Allowing own operator groups, group structures and operator tags - 7 (10.3%)
Other (please comment) - 5 (7.4%)
Total Voters: 68

Pages: 1 [2] 3
  Print  
Author Topic: Features for RapidMiner 5.0  (Read 22779 times)
steffen
Sr. Member
****
Posts: 376



« Reply #15 on: August 26, 2008, 04:08:05 PM »

WHO DARES TO AWAKE ME FROM MY ETERNAL SLUMBER ?

Just kidding. I have not mined a single text yet, so I will be humble and stay quiet. I am thinking about the generic scripting operator...

greetings

Steffen
Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
Digital Dude
Guest
« Reply #16 on: October 17, 2008, 12:36:17 AM »

Hi Ingo,

Its good to see that your looking into this  Cool

Maybe OMeta would fit your needs Shocked

Cordially,

-Digital Dude-

It is possible by ingenuity and at the expense of clarity... [to do almost anything in any language].  However, the fact that it is possible to push a pea up a mountain with your nose does not mean that this is a sensible way of getting it there. - Christopher Strachey (NATO Summer School in Programming)-

Logged
JCD
Guest
« Reply #17 on: October 28, 2008, 12:09:43 PM »

Hi Digital Dude, Hi All,

About OMeta : http://www.cs.ucla.edu/~awarth/ometa/
That's it ?

Well, it may be interesting if different syntaxes are used, like in crawlers/mashups/webservices for example, where one can use "regexp" of course, but "wildcard expressions" too.
I do not know if I already have spoken of these softwares :
- For regexps : "Regex coach" http://www.weitz.de/regex-coach/, made in Germany  Grin it rocks !! And especially for Steffen : it is in Lisp  Cheesy ! It could be integrated into RM as a standalone tool, a bit like the "ANOVA calculator"...
- For large scale analysis : Hadoop at http://hadoop.apache.org/core/. I have read that on that project they are looking for machine learning environment : "pig" http://incubator.apache.org/pig/ and "mahout" http://lucene.apache.org/mahout/ . It may be a licensing issue...
- For efficient crawling : http://lucene.apache.org/nutch/ . Fetches, links and contents are separated (thus allowing link analysis and web mining). Massive crawls are split into segments; the architecture relies on Hadoop, and there is a "Nutch scripting language" that could be used to specify a particular crawl into RM. I switched to that tool after using Websphinx, much too slow !! It uses JDK and Tomcat, but be aware of good versions...

The two last softwares are in Java. I wonder if Webharvest can be used together with Nutch, it could give many different combinations for many types of crawls (scrapbooking, search&index, subject crawling, dead link analysis, sitemap design, etc...)

Cheers,
  Jean-Charles.
Logged
Digital Dude
Guest
« Reply #18 on: October 29, 2008, 12:58:02 AM »

Jean-Charles et al,

That is it  Grin

You may have missed the point of OMeta.

If you are thinking of implementing a scripting language, you may as well implement a meta scripting language  Shocked  Then you can have Lisp or Regexp or what ever scripting language of the day for very little additional effort  Roll Eyes

Cordially,

-Digital Dude-

"We have very few inferior people in the world.  We have lots of inferior environments.  Try to enrich your environment." -Frank Loyd Wright-
Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #19 on: October 29, 2008, 06:57:53 AM »

Hi Digital Dude, Hi All,

I definitely recognize the value of OMeta, it is exactly the path I am taking at the moment : practicing pattern matching to understand formal grammars. The point is that I do not know all the scripting languages synthetized with OMeta, but according other posts with Steffen, is there any choice to take between genericity, "visual simplicity" and time consumption of a scripting language ?
By "visual simplicity" I mean that RapidMiner suggests a way to "work and see" which is its main value, a data format and management. Is it aimed at matrix or statistics computing ? I mean, in this case, are there "ready to use" grammar files for matching "R" scripts for instance ?
The genericity is necesary for preprocessing phases, to cope with data heterogeneity; there, the "meta-matcher" would be wonderful, no doubt, as a swiss knife. But referring to the current poll, the feature desired for next release is "speed in computing". What is to be said about a "formal grammar-based" and "object oriented" scripting language ? Is there any benchmark on this aspect ?
I know that on FreeMind open source project, they have a scripting feature, the point is that they have stability/file security/speed issues...And users participating in writing new script files are not that often !

My suggestion about Hadoop or Nutch was a subtle answer  Wink to other posts, where a user complained that the crawler was a bit too tedious, too slow, that I confirm Undecided. Since RM team is preparing a "chain analysis" plugin, the "linkDB" of Nutch should be very interesting. The MapReduce CPU cluster map management and the specific data storage format of Hadoop seemed interesting to me for the "computing speed/load" requirement...

Cheers,
  Jean-Charles.
Logged
jdouet
Newbie
*
Posts: 21



WWW
« Reply #20 on: October 29, 2008, 07:05:26 AM »

Just a point,

If OMeta were to be implemented, it could be in "Meta" operator category, under the name "MetaScripting". Of course Grin
Here is a good article : http://www.moserware.com/2008/06/ometa-who-what-when-where-why.html

Jean-Charles.
« Last Edit: October 29, 2008, 07:58:22 AM by jdouet » Logged
JCD
Guest
« Reply #21 on: October 29, 2008, 10:25:03 AM »

...By the way... Roll Eyes I have had a few ideas in the meanwhile Tongue

Two more operators related to OMeta (in category "/meta/") :
- MashupGrammars : with a list of grammars to import and a set of OMeta rules to mix them (see previous post for the link towards moserware's blog). For instance, mixing regexp and XPath grammars, or regex and wildcard expressions, etc...
- SpecializeGrammar : Here is the "object oriented" flavour of OMeta, for instance using regexp only matching numbers and rejecting letters, or the other way round.

If such operators were to be written, examples would be probably needed as well as in the tutorial...? Shocked

In any case, I am convinced that it is the same formula that is working tremendously well : visualize what you have just written to have a 'closed loop', to verify and tune the lines of code. In RM, you can switch from code to results, in Regex Coach you can match a pattern against a sample while visualizing the grammar tree, all that on the fly : in such operators as above, it would be needed to have a "show preview" button and casually a wizard, wouldn't it ?

Of course, from developer's point of view, it may need to create another type of object (as ExampleSet or ClusterModel) which would be "SyntaxScript", where a specialized or mashed-up grammar is named, stored or loaded. Then, for each RapidMiner operator using regexps, there should be a parameter "load grammar file".

Thus, the three operators suggested until this post, MetaScripting, MashupGrammars and SpecializeGrammar should be in /meta/, why not in "/meta/IO/", and used exclusively for pattern matching ie either Webscraping or general data preprocessing, but neither for matrix computations nor "branch programming"...Said differently, it should not be OCaml, for instance, but rather a powerful set of I/O and preprocessing operators...

What do you think of that ? Tell me if it is desperately useless  Grin
Jean-Charles.
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #22 on: October 29, 2008, 10:40:17 AM »

Hi all,
interesting topic. Metalanguages ... fascinating . But where is the connection to data mining? I have the slight feeling we are loosing the original target out of sight...
I'm already curious about your next ideas Smiley

Greetings,
  Sebastian
Logged
JCD
Guest
« Reply #23 on: October 29, 2008, 11:30:04 AM »

Hi all,
interesting topic. Metalanguages ... fascinating . But where is the connection to data mining? I have the slight feeling we are loosing the original target out of sight...
I'm already curious about your next ideas Smiley

Greetings,
  Sebastian

> Connection to data mining
Two things :
- metadata management/filter attribute subset
- information scraping inside any file

As written in the wiki for "Regular Expressions" these are the two use case for such grammars as Regexp or XPath. But there are other pattern matching grammars, XQuery in Webharvest, wildcard expressions in Websphinx crawler, etc...
These grammars are not equivalent : Regexp are complicated but powerful, generic; XPath allows to walk along logical trees and in my sense manage better multiple matches than regexp, and so on.
Thus the idea of customizing "pattern matching" grammars to customize the two "connection points" above, to fit to specific IO/preprocessing issues.

Thus this type of scripting language would not be dedicated to data mining core computing, as expected for "R" for instance. I was talking of Regex Coach in Lisp, but this may be a bit confusing, since there is no Lisp to write at all, just focusing on regexps to be designed.

Cheers,
  Jean-Charles.
Logged
JCD
Guest
« Reply #24 on: October 29, 2008, 11:39:12 AM »

Another point (!)...

For AttributeSubsetProcessing, there should be a "show preview" button in front of the "regexp" field. Whenever there are zillions of attributes to filter, such pattern matching functions may become critical and useful...

Cheers,
  Jean-Charles.
Logged
damiano
Newbie
*
Posts: 4


« Reply #25 on: October 07, 2009, 10:14:50 AM »

Hello guys,
I think that an useful feature could be integrate the Rapidminer API with Maven, like Weka is integrated: http://wwmm.ch.cam.ac.uk/maven2/weka/
I know it is a transversal feature, but nowadays Maven is a standard in J2EE projects.

Thanks!
Logged
Sebastian Land
Administrator
Hero Member
*****
Posts: 2426


« Reply #26 on: October 08, 2009, 08:58:04 AM »

Hi,
I'm quite unfamiliar with maven and the link does not describe what I can do with it at all? As far as I know, maven is a building tool link ant. So I'm a bit confused, what I would gain, if I integrate a data mining api into a building tool?

Greetings,
  Sebastian
Logged
damiano
Newbie
*
Posts: 4


« Reply #27 on: October 27, 2009, 06:08:18 PM »

Hi,
from maven.apache.org: "Apache Maven is a software project management and comprehension tool".

Declaring the data mining api as a Maven dependency, it would be used as an external library transparently, without importing all its jars (rapidminer.jar and all the dependant external libraries).
You can see more information about that in http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html.

Thanks to reply Smiley.

Sincerely,
Damiano.

Logged
steffen
Sr. Member
****
Posts: 376



« Reply #28 on: November 10, 2009, 11:12:04 AM »

Jumping in ...

I want to add: It eases the dependency management (in my point of view the most important feature), the building itself can be performed with Maven or with ant (there is, as far as I know, also an Ant-Plugin for Maven).

If you have 1-2 hours time, I recommend the first two chapters of Better Builds with Maven. The first one explains in detail what Maven is for, the second contains sufficient information for daily usage.

regards,

Steffen
« Last Edit: November 10, 2009, 11:17:39 AM by steffen » Logged

"I want to make computers do what I mean instead of what I say"
Read The Fantastic Manual
sgtrock
Newbie
*
Posts: 17


« Reply #29 on: December 24, 2009, 04:24:56 PM »

Hi, all;

I've skimmed the thread and didn't see my issue raised.  I am concerned about using the update process to search for plug-ins as it has some security implications for my company. 

I was prepared to talk about it in some detail when I realized that a post that I made on one of your competitors' site concerning forced registration covered all of the same issues.  With that in mind, here is the relevant text from the two posts:

Quote
I am a senior IT architect for a large financial services company. We have roughly 65,000 employees. My conservative estimate is that this tool looks like it could be quite useful for 5% to 10% of our staff.

Unfortunately, I cannot recommend this kind of tool if it requires external authentication to run. Further, I cannot recommend such a tool for use here if it requires the establishment of a connection through our firewalls. The security folks would have my head on a platter, and rightly so.

If I cannot run your software without registering, can someone please explain the reasoning behind requiring registration?

Quote
I am not willing to publish the name of my company on a forum visible to all. If you're interested in contacting me for more information, you have my email address from my registration information.

However, I will note that the financial services company that I work for is based in the U.S. and thus falls under the regulatory and auditing scrutiny of a whole host of Federal government and other agencies. Off the top of my head, I can think of:

      OCC
      SEC
      FINRA
      FDIC
      Federal Reserve Board
      PCI
      BSA



Each one of these organizations (and several others!) send their auditors crawling through our financial records, computer systems, and business practices every single year. Last year's financial meltdown have motivated these auditors to become far more aggressive in how thoroughly they scrutinize everything that we do. (Rightfully so in my opinion.)

Now that I've explained the regulatory environment that we face, let me address why registration is such an issue. With respect, requiring online registration of individual copies of software for simple use places it outside the realm of software that we can use because it violates our security policies. Not even Microsoft is allowed to sell us software that "phones home."

The reason behind this statement in our security policies is quite simple. There is no feasible, cost effective way for us to determine which connections are supplying just registration information and which ones are supplying far more detail about our computing environment. This information is regarded as confidential because knowing someone's hardware and software mix can be leveraged to reduce the time necessary to hack targeted systems.

Worse, the existence of such a connection could theoretically be used to delve into any information that may be stored locally. Since the services that we provide require that we have a deep and intimate view of our customers' confidential financial information, we are ethically, morally, and legally required to make every effort to avoid even the merest suggestion of a possible leak.

There is no way that we could implement software with this kind of "phone home" requirement without drawing the ire of an auditor from one of the regulatory agencies. That is what makes your software, regardless of how attractive I personally think it might be for my company, off limits for us and every other financial services company in the U.S.

I know that the health care industry faces even more scrutiny than the financial services do, so my guess is that they have similar requirements for protecting patient confidential information. That locks you out of two very large pools of potential customers for your services.

With all that said, I think that an optional registration similar to that used by OpenOffice would probably pass muster.

If you wish to discuss this off line, please don't hesitate to contact me via email. I am willing to continue the debate here if you would prefer. Frankly, I think this conversation is a healthy one and should stay public as long as we can keep this relatively anonymous.

=== end quotes ===

The need to control our computing environment in order to meet our regulatory obligations requires us to maintain full control over what is deployed on our end users' PCs.  Our IT department must be the sole source of all software management.  We must be able to deploy software on our schedule /and/ be able to roll back if and when we choose to. 

I can tell you that I have personal knowledge of two vendors that were rejected this year because they refused to give us this capability.

However, I recognize that plug-ins are a great way to introduce new functionality for a minimal cost.  The good news is that in general we make a distinction between plug-ins that provide that additional functionality and the primary software executables.  This is especially true when we can mitigate the risks in one or more ways.

The first would be for us to provide a "gold" repository of plug-ins that have been vetted by our Information Security department.  This is by far our preferred method.  Is there is a simple configuration file change that would allow us to force our users to go to such a repository rather than back to yours?

The second would be for us to put your company through a security and financial audit to verify that the security around your repository is such that the potential for malware to creep in is minimal.  That's a path that I'm reluctant to take because as you might imagine, it can be time consuming and therefore expensive to complete.

The third path is not one that I think is of much value to our end users.  That would be to block access to your plug-in repository and not allow any to be installed.  I have a sneaking suspicion that at best, such a situation would create some unhappiness and friction between our IT department and the people actually using your software on a day to day basis.

Any thoughts on how we might solve this conundrum?
« Last Edit: December 24, 2009, 05:31:52 PM by sgtrock » Logged
Pages: 1 [2] 3
  Print  
 
Jump to: