Pages: [1]
  Print  
Author Topic: Remove URL from document  (Read 212 times)
fokko
Newbie
*
Posts: 2


« on: September 17, 2014, 06:21:28 PM »

Hello,
I have a problem with my text pre processing. Maybe anyone can help me Smiley

My text looks like this:

T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.
Commodities
Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.
Europe
European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.
[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:



I want to remove the URLs from the text. How can I do this?I think filter tokens does not work?! Is the solution Remove Document parts?

I think the solution should look like this rule: if the word starts with http. or www. then delete the word from the text..... (but only the url of the text)



Kind regards
Logged
homburg
Administrator
Jr. Member
*****
Posts: 78


« Reply #1 on: September 18, 2014, 04:39:40 PM »

Hi fokko,

depending on your setup you might use "Replace" (for example sets) or "Replace Tokens" (for tokenized documents) and use a regex like this: \[\d*\][^\[\]]* to identify all url links from your text input.

Cheers,
Helge
Logged
fokko
Newbie
*
Posts: 2


« Reply #2 on: September 22, 2014, 05:19:32 PM »

Thanks for your response. But I canīt solve my problem. I donīt understand the regex command. If I want to delete the words from the text which beginn with http. , what is the regex? and what is the configuration for the operator?

To sovle the problem, my setup only consists of process document from files and then I tried replace for example sets.

I dont tokenize in my setup. (If I tokenize a URL like www.helpme.com , I would have www help me com. So If I search for www , I cannot delete the complete URL.

Thank you for comments
Logged
homburg
Administrator
Jr. Member
*****
Posts: 78


« Reply #3 on: September 22, 2014, 08:54:10 PM »

Hi!

You don't need to. Please have a look:

Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Load Text" width="90" x="112" y="30">
        <parameter key="text" value="T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.&#10;Commodities&#10;Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.&#10;Europe&#10;European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.&#10;[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:"/>
      </operator>
      <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens" width="90" x="447" y="30">
        <list key="replace_dictionary">
          <parameter key="\[\d*\][^\[\]]*" value="!!REPLACED!! "/>
        </list>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Load Text (2)" width="90" x="112" y="120">
        <parameter key="text" value="T-Mobile US Inc. and two regional carriers, General Communication Inc. in Alaska and CT Cube LP in Texas. The order is subject to review by President Barack Obama.&#10;Commodities&#10;Oil futures rose 67 cents to $93.98 a barrel as U.S. crude supplies dropped, while gold for August delivery climbed $8 to $1,405 an ounce.&#10;Europe&#10;European markets finished sharply lower today with shares in London leading the region. The FTSE 100 was down 2.12% while France's CAC 40 was off 1.87% and Germany's DAX fell lower by 1.20%.&#10;[1]: http://www.proactiveinvestors.com/companies/overview/2245/Salesforce.com [2]: http://www.proactiveinvestors.comcompanies/overview/2245/salesforcecom--2245.html [3]: http://www.proactiveinvestors.com/companies/overview/2397/Goldman+Sachs [4]: http://www.proactiveinvestors.comcompanies/overview/3787/general-motors-company--3787.html [5]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [6]: http://www.proactiveinvestors.comcompanies/overview/1189/dell-1189.html [7]: http://www.proactiveinvestors.com/companies/overview/1189/Dell [8]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [9]: http://www.proactiveinvestors.comcompanies/overview/2306/apple-2306.html [10]: http://www.proactiveinvestors.com/companies/overview/4450/Samsung+Electronics [11]: http://www.proactiveinvestors.com/companies/overview/2306/Apple [12]:"/>
      </operator>
      <operator activated="true" class="text:remove_document_parts" compatibility="5.3.002" expanded="true" height="60" name="Remove Document Parts" width="90" x="447" y="120">
        <parameter key="deletion_regex" value="\[\d*\][^\[\]]*"/>
      </operator>
      <connect from_op="Load Text" from_port="output" to_op="Replace Tokens" to_port="document"/>
      <connect from_op="Replace Tokens" from_port="document" to_port="result 1"/>
      <connect from_op="Load Text (2)" from_port="output" to_op="Remove Document Parts" to_port="document"/>
      <connect from_op="Remove Document Parts" from_port="document" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,
Helge
Logged
Pages: [1]
  Print  
 
Jump to: