Pages: [1]
  Print  
Author Topic: Removing HTTP Headers  (Read 264 times)
hawle087
Newbie
*
Posts: 1


« on: April 21, 2012, 02:40:36 PM »

I'm trying to do some text analytics on a set of pre-downloaded html files but unfortunately they also include the HTTP headers (e.g. Content-type: text/html). I've tried using Remove Document Parts with regular expressions to strip out the headers before passing the document to Extract Content, but for some reason the Extract Content operator ignores the removals. To test this I setup a  simple process that takes a text file as input containing the words "one two three". The Remove Document Parts removes the word one (checked via breakpoint) but the final output includes it. Can anyone help me understand why Extract Content is ignoring the prior removal, or suggest some workarounds or alternate methods of removing HTTP headers from files?

Thanks.

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="460" width="899">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="test" value="C:\Users\XXX\test_files"/>
        </list>
        <process expanded="true" height="460" width="899">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="45" y="30">
            <parameter key="deletion_regex" value="one"/>
          </operator>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
            <parameter key="minimum_text_block_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="RM One" to_port="document"/>
          <connect from_op="RM One" from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Updated:

As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.
« Last Edit: April 22, 2012, 02:10:23 PM by hawle087 » Logged
Nils
Administrator
Sr. Member
*****
Posts: 344


« Reply #1 on: April 23, 2012, 08:59:47 AM »

Hi,

if you place a 'Combine Documents' operator after the 'Remove Document Parts' it worked for me.

Code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="341" width="413">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="313" y="75">
        <list key="text_directories">
          <parameter key="test" value="C:\Users\XXX\test"/>
        </list>
        <process expanded="true" height="461" width="889">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="179" y="30">
            <parameter key="deletion_regex" value="one"/>
          </operator>
          <operator activated="true" class="text:combine_documents" compatibility="5.2.001" expanded="true" height="76" name="Combine Documents" width="90" x="313" y="30"/>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="514" y="30">
            <parameter key="minimum_text_block_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="RM One" to_port="document"/>
          <connect from_op="RM One" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
          <connect from_op="Combine Documents" from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Best,
Nils
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Pages: [1]
  Print  
 
Jump to: