1) my text is filtered against a set of English stop words and some words are pruned (Ex. and, or..).
- I have to work with texts on biology and so I'm wondering what happens with strange words such as IL-6. Are these words filtered or maintained?
As far as I know they should be maintained.
2) The stemmer keeps only the "basic chunks" of my words. I think that this is based on a dictionary.
- Could you tell me which dictionary is that? I need to know that precisely in order to answer to the question "does it contain some medical terms such as glicolase..?" that is crucial for me now
- What does it happen to my strange word (Ex. IL-6)? Are they pruned, chunked in some way or kept as they are?
The stemming is not performed based on a dictionary but on a stemming algorithm. Here you can find a description of the algorithm:http://tartarus.org/~martin/PorterStemmer/def.txt