Pages: [1]
  Print  
Author Topic: Dictionaries and stemming  (Read 2685 times)
Lorenzo
Newbie
*
Posts: 7


« on: June 11, 2008, 11:38:25 AM »

Hi everyone!
I'm currently using the text plug-in and I want to clarify a bit some of its peculiarities. I'm not using the block DictionaryStemmer and I'm simply working with the English stopword filter, the tokenizer and the Porter stemmer.

What I guessed is that:
1) my text is filtered against a set of English stop words and some words are pruned (Ex. and, or..).
    - I have to work with texts on biology and so I'm wondering what happens with strange words such as IL-6. Are these words filtered or maintained?
2) The stemmer keeps only the "basic chunks" of my words. I think that this is based on a dictionary.
    - Could you tell me which dictionary is that? I need to know that precisely in order to answer to the question "does it contain some medical terms such as glicolase..?" that is crucial for me now
    - What does it happen to my strange word (Ex. IL-6)? Are they pruned, chunked in some way or kept as they are?

Thanks for your kind attention. Hope that someone can help!
Lorenzo

Logged
Tobias Malbrecht
Global Moderator
Sr. Member
*****
Posts: 293



WWW
« Reply #1 on: June 13, 2008, 10:17:49 AM »

Hi Lorenzo,

unfortunately I am not that familiar with the text plugin and thus I can not answer your questions immidiately. But I will try to get that information from the developer of the text plugin. This however might take a while. I post again as soon as I obtained the information.

Regards,
Tobias
Logged

Tobias Malbrecht
Director of Product Marketing
RapidMiner
Lorenzo
Newbie
*
Posts: 7


« Reply #2 on: June 13, 2008, 02:02:55 PM »

Thank you very much for your kind attention and for your availability.
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #3 on: June 24, 2008, 06:59:12 PM »

Hello,

Quote
1) my text is filtered against a set of English stop words and some words are pruned (Ex. and, or..).
    - I have to work with texts on biology and so I'm wondering what happens with strange words such as IL-6. Are these words filtered or maintained?

As far as I know they should be maintained.

Quote
2) The stemmer keeps only the "basic chunks" of my words. I think that this is based on a dictionary.
    - Could you tell me which dictionary is that? I need to know that precisely in order to answer to the question "does it contain some medical terms such as glicolase..?" that is crucial for me now
    - What does it happen to my strange word (Ex. IL-6)? Are they pruned, chunked in some way or kept as they are?

The stemming is not performed based on a dictionary but on a stemming algorithm. Here you can find a description of the algorithm:

http://tartarus.org/~martin/PorterStemmer/def.txt

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
mjw
Guest
« Reply #4 on: July 06, 2008, 01:23:48 PM »

Hi,

there are several steps to word vector creation. The first of them is tokenizing. The simple tokenizer in RM discards everything that is not recognized as character and uses it as split point. Thus, I am afraid that your fancy words will not survive this step. You can either:

1. Use the Feature Extractor to extract your fancy words explicitly
2. Implement a custom tokenizer (the API is really easy to implement against)

Instead of (or in combination with) the Porter Stemmer, you can provide your own dictionary containing regex..

Best regards,
Michael
Logged
Pages: [1]
  Print  
 
Jump to: