Pages: [1]
  Print  
Author Topic: How to compare similarity of large number of documents  (Read 525 times)
etharpe
Newbie
*
Posts: 3


« on: December 23, 2011, 06:07:35 PM »

Hello,

I'm looking for a way to find the similarities between a large number of documents to each other, i.e., similarity document A to B, similarity A to C, B to C, etc. I have been using the Text Mining extension.

The process I have been using is:
Retrieve > Nominal to Text > Data to Documents > Process documents (TF_IDF) (+Tokenize) > Data to Similarity (CosineSimilarity)

The documents are short, under 30 words.
There are about 1200 documents.

This works for a small number of documents, normally in 2-3 seconds. However, when I try to use it for all of the 1200 documents, RapidMIner says the process is completed in 0 seconds and then doesn't show any results. The bar on the bottom right remains frozen on "Creating Displays." Program stops working.

Does this happen because there are too many results for the operation? If so, what is the correct approach?

Help would be very much appreciated.

This is the full process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="521" width="748">
      <operator activated="true" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="//Repository1/Martyrs/Data/document similarity test data"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="120">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="C"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="210">
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="246" y="300">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prunde_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="5.0"/>
        <parameter key="prune_above_rank" value="5.0"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="parallelize_vector_creation" value="false"/>
        <process expanded="true" height="610" width="980">
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="181" y="42">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="data_to_similarity" compatibility="5.1.014" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="435">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="CosineSimilarity"/>
        <parameter key="divergence" value="GeneralizedIDivergence"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
      <connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Logged
etharpe
Newbie
*
Posts: 3


« Reply #1 on: January 04, 2012, 05:08:30 PM »

Any ideas on this, anyone? I imagine the solution would be to create some kind of loop:
  First, Rapidminer creates a compiled list of the tokens in all the documents
  Second, based on that list, Rapidminer compares the similarity of document A to document B, then C, then D, ...
  Third, Rapidminer compares similarity of document B to document C, then D, ...
  Fourth, Rapidminer compares similarity of document C to document D, then E, ...

Problem is, I have no idea how to do this!

Eagerly awaiting your thoughts, and thank you.
Logged
awchisholm
Full Member
***
Posts: 157


« Reply #2 on: January 05, 2012, 02:11:43 PM »

Hello

If you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.

You can then read the result later and use the filter or sample operators to extract the bits you're interested in.

regards

Andrew
Logged
etharpe
Newbie
*
Posts: 3


« Reply #3 on: January 09, 2012, 10:29:31 AM »

Yes, that's done it. Thank you very much.
Logged
iinnaanncc
Newbie
*
Posts: 3


« Reply #4 on: February 09, 2012, 01:52:59 PM »

Hello

If you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.

You can then read the result later and use the filter or sample operators to extract the bits you're interested in.

regards

Andrew

Dear Andrew,

I am able to get Similarity results (which has 3 columns first, second, similarity) with small number of rows on RapidMiner. But when I want to get higher number of row as result of similarity, I get same problem which says Creating Displays and waits forever Smiley

As your solution, I want to store similarity results in an excel file or in a database. For example if I want to add an Write to Excel operator, it does not accept similarity as an input. How can export these similarty results into an excel file?
Logged
awchisholm
Full Member
***
Posts: 157


« Reply #5 on: February 09, 2012, 11:54:55 PM »

Hello

Use the "simillarity to data" operator to convert to an example set

regards

Andrew
Logged
iinnaanncc
Newbie
*
Posts: 3


« Reply #6 on: February 17, 2012, 12:59:33 PM »

Thanks!
Logged
Pages: [1]
  Print  
 
Jump to: