Good afternoon everyone.
I'm a very newbie in the context of text mining and of rapid miner usage.
I used the text plug-in of rapid miner and I have a few questions for who is so kind to answer.
1) when I process a group of texts I get a big matrix where the items (documents) are the rows and the features (the stems) are the columns. what is the metric that fills each cell (I want to be sure about the meaning of the number inside each cell)? Can I change it?How?
2) what is (I simply want an opinion) the more suitable of these metrics (if there is more than one) to exploit the matrix for clustering analysis?
3) The stemmer and the tokenizer divide my text into words (if the text is "always happy or sad" I'll get the stems corresponding to always, happy...).
Is it possible in RM to work not on a single word but on groups of words (in medical and scientific text very often I have lexicons such as "acetic anhydride" that should be considered as a unique token)?
I apologize because I'm always too verbose
Thanks for your kind attention..hoping that someone can help.