Pages: [1]
  Print  
Author Topic: WEB crawler rules  (Read 1651 times)
keops9876
Newbie
*
Posts: 5


« on: June 25, 2013, 05:05:27 PM »

Hi!

I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.

I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.

I know that there are 2 rules important: what to follow and what to store.

I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280

What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+

Any help would be apreciated!

U.
Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #1 on: June 26, 2013, 10:39:37 AM »

Hey U,

on a quick check I got some pages with the following settings:
url: http://www.realestate-slovenia.info/
both rules: .+id.+

And I also increased the max page size to 10000.

As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
keops9876
Newbie
*
Posts: 5


« Reply #2 on: June 26, 2013, 07:37:45 PM »

Marius,

the web page allows robots.

Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.

Tnx for helping.
« Last Edit: June 26, 2013, 07:56:20 PM by keops9876 » Logged
Marius
Administrator
Hero Member
*****
Posts: 1794



WWW
« Reply #3 on: June 27, 2013, 09:41:46 AM »

Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
keops9876
Newbie
*
Posts: 5


« Reply #4 on: July 24, 2013, 03:19:08 PM »

Marius,

I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.

This is my Web crawler process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&amp;]pg=.+ | id=.+)"/>
          <parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
        </list>
        <parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="max_page_size" value="10000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

As you can see I try to follow 3 types of URL, for example

http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923

And I want to store only one type of URL

http://www.realestate-slovenia.info/nepremicnine.html?id=5469846

So for the first task my rule is

http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)

Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+

Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.

I would really like this process to work cause gathered data are the basis for my article.

Tnx.

Logged
Pages: [1]
  Print  
 
Jump to: