For each category I have close to 100 examples. BTW, what is the ideal number of examples? I'm only working on the abstract section of the documents.
We talking about a statistical problem here. I will give you another example: You are given a 6-sided dice and now have to decide whether this dice is fair or not. How often do you have to throw the dice to tell ? (Wikipedia - Statistical Test
). In your case of 15 classes the question is interesting which performance you have to achieve to be better than random (1/15) ? I cannot cover this topic here, but there is a lot of statistical literature out there to calculate all these numbers (i.e. number of examples per category, minimum performance etc.).
RapidMiner offers the standard t-test ... but before we start testing, let's see if we can achieve some improvements at all.
But is it possible to do hierarchical categorization in RapidMiner?
Like Haddock once said (oh, I should add this one to my signature), "RapidMiner is like Lego". You can achieve nearly anything with the right combination of operators. I will give you some hints:
- AttributeConstruction in combination with ChangeAttributeRole or ExchangeAttributeRoles to aggregate labels
- ProcessBranch to realize an if-else-statement
- ValueIterator allows you to iterate over the values of your label attribute
- ProcessLog to log the performance
It is quite hard to create an automatic
process, which finds the optimal merge of categories for your problem. Indeed, it would take more than an one hour (or more) for an experienced user, so I suggest that you try manual combinations (including the domain knowledge you have) to get a better feeling which classes to merge. Please understand that I cannot provide a complete process here. Play around and I will guarantee that you will appreciate RapidMiner more and more
Last question: What exactly does the "attribute weight" do? From what I understand, you apply the attribute weight to an exampleset to change the values of the attributes. What else is it use for?
The AttributeWeight is an indication of how important the attribute is for distinction of the classes. In case of FeatureSelection it is always 1 or 0 (use it or dont), other operators (like InformationGainWeighting) provide a less crisp evaluation. Use the operator AttributeWeightSelection
to filter the attributes to remove redudant or (worse) disturbing information.
As I said above, the optimal featureset may / will depend on the current "merge situation" of your categories.
I wish you success
PS: If it wont work, try this: http://www.youtube.com/watch?v=egfCXLHfw-M
(cannot get rid of this song