Pages: [1]
  Print  
Author Topic: Text Mining - FP Growth stucks  (Read 1568 times)
jansudes
Newbie
*
Posts: 4


« on: September 17, 2013, 01:13:43 PM »

Hello,

I am quite new in RapidMiner and to practice on my own I decided to work on 7 .txt files containing 42,000 lines (or approximatively 70,000 characters) in total.
My first intention was to make an association analysis based on these texts.

The steps that I followed were:
1-Process the data (tokenize, filter...)
2-Numerical to binomial
3-FP-Growth
4-Create Association rules

However it gets stuck during FP-Growth where the process continues for almost an hour and then shows a memory insufficiency problem.
I have no idea why that might me caused.

I would like to be helped from some experience Smiley

Thank you.
Jansu
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #1 on: September 18, 2013, 10:07:11 AM »

Hi there,

I spent some time on RM Association Rules, and have posted on this subject before, so you can search for posts on that subject.

Being practical I would...

1. Put a break on before the FP-Growth operator runs, just to check that all your examples are Binominal, and in good order, no missings and so on.

2. Start with a frequency threshold that is so high that it produces no itemsets, and then lower that threshold. If you go too low you will fill up the memory with loads of itemsets.

3. Check the source code of the Association Rules operator. If it still uses iterates over the powerset of each itemset then give up on this approach unless your itemsets are short! I posted about this a while ago, and Ingo acknowledged the weakness.

Best wishes,

H
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
MH
Newbie
*
Posts: 3


« Reply #2 on: July 03, 2014, 10:12:39 PM »

In addition, reducing the max items helps reduce run time.
Logged
haddock
Hero Member
*****
Posts: 853



WWW
« Reply #3 on: July 04, 2014, 09:59:30 AM »

Hi there MH,

That's certainly true; it's also worth noting that, for English at least, just 100 connective words, which carry little significance, constitute about half of normal text. In my current work, which mines association rules from newsfeeds ( http://datamonkees.wpengine.com ) this tip really helped. It also appears to work in Spanish and French, so it's worth paying attention to your stopword list in the pre-processing phase.

Best wishes

H

PS The 100 English word list can be found here http://datamonkees.wpengine.com/2014/03/the-associator/
Logged

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T.S.Eliot ~ Choruses from the Rock 1934
Pages: [1]
  Print  
 
Jump to: