Pages: [1]
  Print  
Author Topic: How to create new examples by spliiting at punctuation marks?  (Read 215 times)
chrisniem
Newbie
*
Posts: 2


« on: July 20, 2012, 09:04:56 AM »

Hi all!

I wonder if it is possible to split an example containing text by punctuation marks. I have an exampleset containing some metadata for a text attribute. The text attribute contains many sentences. Here are 2 examples as demonstration:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s     economy. With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return. Others have been cutting their corn early to use for feed, a much less profitable venture."

What I want to do is to split the text attribute by e.g. "." while keeping the metadata for every sentence. The result would be 4 examples:

2012-05-04          Source1          Speaker1          Context1          "The unsettling prospects come at a time of growing uncertainty for the country’s     economy."
2012-05-04          Source1          Speaker1          Context1          "With evidence mounting of a slowdown in the economic recovery, this new blow from the weather is particularly ill-timed."
2012-05-06          Source2          Speaker2          Context2          "Already some farmers are watching their cash crops burn to the point of no return."
2012-05-06          Source2          Speaker2          Context2          "Others have been cutting their corn early to use for feed, a much less profitable venture."

Is there any way to do this? I tried to use tokenization, but it delivers only vectors (i.e. new attributes) but not new examples. If switch off vectorization I can not see any difference in the result set apart from "." beeing deleted in the text attribute.

Any help is very appreciated!

Thanks

Chris
Logged
Marius
Global Moderator
Hero Member
*****
Posts: 1283



WWW
« Reply #1 on: July 20, 2012, 09:52:10 AM »

Hi Chris,

you can use e.g. Cut Documents for this. You may have to tune the regular expression a bit, but the process below depicts the general idea.

Best,
  ~Marius

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="505" width="721">
      <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="120">
        <list key="attribute_values">
          <parameter key="meta" value="false"/>
          <parameter key="text" value="&quot;This is also a test. With two sentences.&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_data_user_specification" compatibility="5.2.008" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="45" y="30">
        <list key="attribute_values">
          <parameter key="meta" value="true"/>
          <parameter key="text" value="&quot;Test. Sentence. Blubb.&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="append" compatibility="5.2.008" expanded="true" height="94" name="Append" width="90" x="179" y="30"/>
      <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true" height="505" width="658">
          <operator activated="true" class="text:cut_document" compatibility="5.2.004" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries">
              <parameter key="t" value="\..\."/>
            </list>
            <list key="regular_expression_queries">
              <parameter key="t" value="([^\.]+)"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true" height="523" width="658">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|meta|text"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
chrisniem
Newbie
*
Posts: 2


« Reply #2 on: July 20, 2012, 03:10:20 PM »

Hi Marius,

great, that will do it!

Thanks a lot!

Chris
Logged
Pages: [1]
  Print  
 
Jump to: