Pages: [1]
  Print  
Author Topic: Determine text similarity?  (Read 1347 times)
Bill
Guest
« on: March 26, 2009, 05:41:03 PM »

Hi,

is it possible to use RapidMiner to determine the similarity of two texts (i.e. using cosing similarity)?

I played around with RapidMiner and the text plugin. I managed to create word vectors using TextInput and applied StringTokenizer, EnglishStopwordFilter and PorterStemmer.

But now I'm stuck. How can I compare two text files and determine their similarity?

I'm thankful for any hint!
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #1 on: March 26, 2009, 06:52:24 PM »

Hi,

did you try out the operator ExampleSet2Similarity? If you search for "similarity" in the field below the operator groups in the "New Operator" tab or in the text field of the "New Operator" dialog, this (and other similarity related) operator should come up...

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Bill
Guest
« Reply #2 on: March 27, 2009, 09:15:45 AM »

Hi,
thanks for your answer.

I tried that, but will give me an illegal argument exception: null or zero length argument @ ExampleSet2Similarity

Howerver, within DataStatistics, I have output which looks like this:
Quote
#1: authorit (real/single_value): avg = 0.09033030954895138 +/- 0.0; unknown = 0.0 #2: sourc (real/single_value): avg = 0.30110103182983794 +/- 0.0; unknown = 0.0 #3: data (real/single_value): avg = 0.5419818572937083 +/- 0.0; unknown = 0.0 #4: alia (real/single_value): avg = 0.030110103182983794 +/- 0.0; unknown = 0.0 #5: system (real/single_value): avg = 0.030110103182983794 +/- 0.0; unknown = 0.0 #6: record (real/single_value): avg = 0.06022020636596759 +/- 0.0; unknown = 0.0 #7: trust (real/single_value): avg = 0.06022020636596759 +/- 0.0; unknown = 0.0 #8: motiv (real/single_value): avg = 0.030110103182983794 +/- 0.0; unknown = 0.0 #9: patient (real/single_value): avg = 0.030110103182983794 +/- 0.0; unknown = 0.0 #10: heath (real/single_value): avg = 0.030110103182983794 +/- 0.0; unknown = 0.0

My process looks like this:
Logged
Ingo Mierswa
Administrator
Hero Member
*****
Posts: 1226



WWW
« Reply #3 on: March 31, 2009, 11:19:45 AM »

Hello,

hmm, I just tried that myself with a data set delivered together with the Text plugin and everythings seems to work normally. Here is the process:

Code:
<operator name="Root" class="Process" expanded="yes">
    <operator name="TextInput" class="TextInput" expanded="yes">
        <list key="texts">
          <parameter key="graphics" value="../data/newsgroup/graphics"/>
          <parameter key="hardware" value="../data/newsgroup/hardware"/>
        </list>
        <parameter key="default_content_encoding" value="ISO-8859-1"/>
        <parameter key="prune_below" value="2"/>
        <list key="namespaces">
        </list>
        <parameter key="create_text_visualizer" value="true"/>
        <parameter key="on_the_fly_pruning" value="3"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        </operator>
        <operator name="EnglishStopwordFilter" class="EnglishStopwordFilter">
        </operator>
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="min_chars" value="3"/>
        </operator>
        <operator name="ToLowerCaseConverter" class="ToLowerCaseConverter">
        </operator>
        <operator name="TermNGramGenerator" class="TermNGramGenerator">
        </operator>
    </operator>
    <operator name="ExampleSet2Similarity" class="ExampleSet2Similarity">
    </operator>
</operator>

Cheers,
Ingo
Logged

Did you try our new Marketplace? Upload or download new Extensions, add comments, and organize your operators. Have a look at  http://marketplace.rapid-i.com
Pages: [1]
  Print  
 
Jump to: