Pages: [1]
  Print  
Author Topic: web research  (Read 1716 times)
alphabeto
Newbie
*
Posts: 8


« on: October 01, 2013, 11:13:07 AM »

Hi,
Can rapid miner do a automated regular research (say daily) of a list of words in a list of url, and get each page link?
I have a list of  words and I want to regularly get every web link where any of these words appears in any of the web url from my predefined urls list.


Eg. wordlist : qwe, rty
url list: www.asd.com, www.zxc.com

What is the process path in order to get daily and automated each web link where words "qwe" and/or "rty" apear in the www.asd.com and/or www.zxc.com


Many thanks
Dan
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #1 on: October 02, 2013, 01:46:17 PM »

Hi Dan,

you can use the Get Pages operator to get the contents of a number of websites whose links you provide in a data table.
You can then use the text processing extension to count the words that appear in the different sites. Our websites provides some links to video tutorials for the text mining extension: http://rapid-i.com/content/view/189/212/lang,en/
To focus on the contents of the websites and remove all html tags you can use the Extract Content operator.

Finally, to execute the job regularly, you should use the RapidAnalytics server, also available on our website.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
alphabeto
Newbie
*
Posts: 8


« Reply #2 on: October 03, 2013, 10:39:06 AM »

Hi Marius,

Thank you. I'm almost there. But in order to solve this and get the job done, after I extract words with "extract content" as you say, I further need to get a doc. list or a folder with the pages (the url links in a doc., or html pages in a floder, etc.) for every word extracted. How can I do this?

Thanks,
Dan
Logged
alphabeto
Newbie
*
Posts: 8


« Reply #3 on: October 03, 2013, 10:47:45 AM »

In other words,  my job would be to filter a pre-defined list of sites (with the filter being a list of varios words) AND THE RESULT must be to get the specific WEB LINKS to the pages where those words appear the predefined sites.
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #4 on: October 10, 2013, 07:54:07 AM »

Hi Dan,

after the Process Documents operator you should have a table that contains the occurrences of each word (columns) in each document (rows), alongside with the URL of the page in the URL attribute.

Now you can iterate your target words and use Filter Examples to keep only those rows where the column for the current word contains a value greater than zero. Then you can Write the URLs of the matching documents to the harddisk, e.g. with the Write Excel or Write CSV operator.

Does that help? If you have any questions left, please attach the XML of your process such that we can use it as a base for our answer.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
alphabeto
Newbie
*
Posts: 8


« Reply #5 on: November 22, 2013, 09:24:41 AM »

Hello Marius,

I sent you by email the xml of my process as you mentioned. Can I count on your answer to my email regarding making the process work head-to tail?

Many thanks!
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #6 on: November 22, 2013, 10:54:02 AM »

Hi Dan,

please post your process publicly to this thread - it may also be interesting for other users.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
alphabeto
Newbie
*
Posts: 8


« Reply #7 on: December 02, 2013, 02:00:43 PM »


Hi Marius,

Bellow is the precess, as far as I could go. Can I count on you to make it work an finalize this job (actually, and finally get the url list of the pages where the researched words appear in the predefined list of websites)?

Thanks again and hoping for the best,
Dan

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="5.3.013" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
        <parameter key="excel_file" value="C:\xxx\Links.xls"/>
        <parameter key="imported_cell_range" value="A1:B6"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="links.true.file_path.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="5.3.001" expanded="true" height="60" name="Get Pages" width="90" x="179" y="30">
        <parameter key="link_attribute" value="links"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="30">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="eurpoa" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.3.013" expanded="true" height="76" name="Write Excel" width="90" x="380" y="210"/>
      <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="380" y="210">
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
          <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="246" y="30">
            <parameter key="string" value="europa"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Filter Documents (by Content)" to_port="documents 1"/>
          <connect from_op="Filter Documents (by Content)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
      <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Logged
alphabeto
Newbie
*
Posts: 8


« Reply #8 on: December 11, 2013, 01:37:14 PM »

hello, I really need to know whether I can count on support here on this matter.

Best regards and thanks again,
 
Logged
Marco Boeck
Administrator
Hero Member
*****
Posts: 953


WWW
« Reply #9 on: December 11, 2013, 02:13:16 PM »

Hi,

just a friendly reminder, this is a community forum where members of the community can help each other out. Sometimes, when time allows, we do chip in and provide answers to some questions. However there is never a guarantee that we will answer in this forum. If you do need support with fixed answering times, please contact us and inquire about enterprise support.

Regards,
Marco
Logged

alphabeto
Newbie
*
Posts: 8


« Reply #10 on: December 11, 2013, 03:14:58 PM »

Ok, I am sorry if I was somewhat pushy or too much inquiry.
However, in case someone has some idea for this, it would be of great support, as I need it to finalize some work with it.
Regards and a very nice day!
Logged
Pages: [1]
  Print  
 
Jump to: