Pages: [1]
  Print  
Author Topic: Text Processing - How to track which are the exact documents contain the word?  (Read 261 times)
Tan Koon Chin
Newbie
*
Posts: 4


« on: August 25, 2014, 02:30:35 AM »

Hi all,

I have processed the TEXT MINING operators and obtained the ExampleSet (WordList to Data) & WordList (Process Documents From Files). Number of occurrence for words has been shown in the result too. How about if I wish to determine the words in result belong to which documents?

Example: The word "apple" appears 100 times in 80 documents. How to track and determine which are the exact documents contain the word "apple"? What am I missing here? Any solution for it?


Thanks in advance.

Regards.
Logged
awchisholm
Sr. Member
****
Posts: 398


WWW
« Reply #1 on: August 25, 2014, 12:56:53 PM »

Hello

Take a look at the following process. The example set output contains labels corresponding to the document and by using term occurrences when processing the documents, you can see the word counts for each document.

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="165">
        <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc1"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="255">
        <parameter key="text" value=" banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc2"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document (3)" width="90" x="112" y="390">
        <parameter key="text" value="apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon apple banana lemon&#10;peach &#10;strawberry&#10;raspberry&#10;apple&#10;cherry&#10;melon"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="doc3"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="130" name="Process Documents" width="90" x="380" y="165">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 3"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew
Logged

Tan Koon Chin
Newbie
*
Posts: 4


« Reply #2 on: August 26, 2014, 02:58:43 AM »

Thank you for the concern.

How about if multiple documents have been processed?
(If just a few documents can use "Create Document" operator and label each of them)

For example, the result of WordList shown is as below:

Word                 Total Occurrence       In Documents
Apple                           200                            180
Orange                       150                             130
Strawberry                  90                               50

The result reveals that "Apple" appears 200 times in 180 documents.

Is there any method to know that which are those 180 documents from the analysis result? (E.g. Doc. 10, Doc. 16, Doc. 45)

Regards,
Tan
Logged
awchisholm
Sr. Member
****
Posts: 398


WWW
« Reply #3 on: August 26, 2014, 08:11:28 PM »

If you are using the "Process Document from Files" operator, the file name for the document will appear in the output example set if the option "add meta information" is set to true. The attribute name is metadata_file.

Andrew
Logged

Tan Koon Chin
Newbie
*
Posts: 4


« Reply #4 on: August 29, 2014, 01:32:41 AM »

Thanks Andrew for the solution !!

Best Regards.
Logged
Pages: [1]
  Print  
 
Jump to: