Pages: [1]
  Print  
Author Topic: [SOLVED] Web Mining: Crawl Web works or not - depending on site, bug or feature?  (Read 618 times)
number6
Newbie
*
Posts: 2


« on: September 19, 2013, 12:20:11 PM »

Is there known bugs in Web Mining: Crawl Web procedure? I have noticed several forum threads in web asking same question - but no answers.

Tested now Rapidminer Version 5.3.013 and latest Webmining package - two sites mentioned below in code, same logic used and one works and one does not.

1. This works:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://uta.fi"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*tutkimus.*"/>
          <parameter key="follow_link_with_matching_url" value=".*tutkimus.*"/>
        </list>
        <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


2. But this does not although the logic is very same:
Code:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://kaksplus.fi/keskustelu/plussalaiset/mitas-nyt"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*keskustelu.*"/>
          <parameter key="follow_link_with_matching_url" value=".*keskustelu.*"/>
        </list>
        <parameter key="output_dir" value="C:\Users\Administrator\Desktop\Huoltamo\DataMining\crawlwebtest"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="really_ignore_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
I wonder why? Indeed, is there any way to see a bit more details - step-by-step what is the operator doing when parsing the page? So that you could maybe found out the reason by yourself?

Is the rapidminer "crawl web" generally reliable or should I rather use some other software for crawling pretty big forum sites - and just use rapidminer then for mining the crawled files?

« Last Edit: October 10, 2013, 08:24:47 AM by Marius » Logged
pjdoubleyou
Newbie
*
Posts: 5


« Reply #1 on: September 22, 2013, 09:37:02 PM »

I had a similar issue, where RM would crawl some sites, but not others, I bumped up max page size to 1000kb and now it works very well.

Logged
number6
Newbie
*
Posts: 2


« Reply #2 on: October 03, 2013, 10:36:43 AM »

Pjdoubleyou, Thank you very much - it helped! The source URL-page was indeed over 100KB although the fetched pages were less.
Logged
Pages: [1]
  Print  
 
Jump to: