Pages: [1]
  Print  
Author Topic: [SOLVED] Help Needed in tokenization  (Read 350 times)
gunjanamit
Newbie
*
Posts: 28


« on: May 27, 2012, 08:19:42 AM »

Hi,

I have an excel which has some comments written in each cell on A column.
I want to extract all the words and find word frequency.

My XML is


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.001" expanded="true" name="Process">
    <process expanded="true" height="251" width="413">
      <operator activated="true" class="read_excel" compatibility="5.2.001" expanded="true" height="60" name="Read Excel" width="90" x="59" y="101">
        <parameter key="excel_file" value="C:\Users\guagg\Desktop\HP\samp.xls"/>
        <parameter key="imported_cell_range" value="A1:A30"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="120">
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="9999"/>
        <list key="specify_weights"/>
        <process expanded="true" height="414" width="762">
          <operator activated="true" class="text:tokenize" compatibility="5.2.002" expanded="true" height="60" name="Tokenize" width="90" x="124" y="28">
            <parameter key="characters" value=" "/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


But its not working. Can someone help please?
« Last Edit: May 30, 2012, 12:23:29 PM by Marius » Logged
Rene
Newbie
*
Posts: 15


« Reply #1 on: May 30, 2012, 01:51:27 AM »

Under "Problems" Rapid Miner tells you that "[t]he example set must contain at least one text attribute."
So ...
  • adding a "Nominal to Text" operator between "Read Excel" and "Process Docs from Data"
    and selecting the corresponding column OR
  • directly declaring your desired column as type=text instead
    of type=polynominal within the import process
     
... might work.

Here's a demo video: http://www.youtube.com/watch?v=3Kntxr16EwE
Logged
gunjanamit
Newbie
*
Posts: 28


« Reply #2 on: May 30, 2012, 06:58:03 AM »

Awesome Rene.....

It works....

Thanks a lot for your help!!!!!!!!!!

Could you please share your email ID with me.
Logged
Pages: [1]
  Print  
 
Jump to: