Pages: [1]
  Print  
Author Topic: Scan index files of books for important terms  (Read 466 times)
DaCapitalist
Newbie
*
Posts: 2


« on: March 11, 2011, 11:41:19 AM »

Hi there!

I'm not sure if this is the right forum to post this problem, but I hope you guys can help me.

The scenario is: We have a lot of index-files in RTF-format like the glossaries at the end of an academic book.
We want to analyze which words and expressions occur the most and as such are the most important in this field of study.

I know that it is easy with rapidMinder to count all tokens in these files, but often the expressions are a combination of two or even more words which you can only detect if you look at the text layout, like:

user 154-167
    behaviour 178-190
    goal 32-38
    ....

You get what I mean? I'm not sure if this problem is solvable with rapidMiner and in particular not HOW. Can you help me with some advice either on rapidMiner or another tool which can help me with that?

Thank you very much!
DaC
Logged
DaCapitalist
Newbie
*
Posts: 2


« Reply #1 on: April 08, 2011, 11:58:17 AM »

No idea for this? Anyone?
Thought this should be possible with RapidMiner...  Undecided
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1210



WWW
« Reply #2 on: April 08, 2011, 06:30:47 PM »

Hi,

it is - but does this help you? We have done something very similar to this and it involved a heady load of information extraction from the structured file information which can be really a pain if layout information is high. So if you want me to actually show you an out-of-the-box process doing this: I have somewhere a price tag sticked to my back  Wink

Seriously, this might turn out to be a hard task - depending on the set of files you are analyzing and how different they are. You can actually learn those dependencies (we had a masters thesis about that at my former department) but this quickly can become a multi-month project. So if you are interested (we certainly are) please contact Rapid-I directly.

Sorry for not having better news,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Pages: [1]
  Print  
 
Jump to: