Pages: [1]
Author Topic: Extracl Data from HTML pages with loops  (Read 732 times)
Posts: 2

« on: September 30, 2013, 02:18:48 AM »

Hi Team,

I have been a big of Rapidminer and I try to explore more and more into this tool. Today, I wanted to scrape the data from the review site.
1. Download the pages from the site
2. Crawl through each page to extract the data

I am able to do the first part and able to download the page from the URL:

Then I want to capture data from each review and I was able to capture the Xpath from the googledocs exactly the way explained in

I want my process to not only loop through multiple files but also through the file itself for multiple reviews. 1 file has approximately 8 reviews and I want to loop through this file as well as 7 other files so in all 64 reviews. I am using "Process document from file" --> "Extract Information"

Settings for - "Process document from file"

File from a list of directories, file pattern - *, use file extension, add metadata information

Settings for "Extract Information"

Query type - Xpath, Attribute type - nominal, Xpath queries as below, namespace - nothing, Ignore CDATA and Assume HTML - checked

But when I am using that in the tool, I am not able to configure that due to some reason and its failing. Can anyone please advice me here? Huh

Here is my xpath in the extract information operator:

1. //h:*[@class="review comment "]/h:div/h:h4/span (User)
2. //h:*[@class="review comment "]/h:div/h:h4/a (User_Type)
3. //h:*[@class="review comment "]/h:div/h:span/h:span/span[1] (Ratings)
4. //h:*[@class="review comment "]/h:div/h:dl/dd[1] (Pros)
5. //h:*[@class="review comment "]/h:div/h:dl/dd[2] (Cons)
6. //h:*[@class="review comment "]/h:div/h:p/span (Purchase_Date)
7. //h:*[@class="review comment "]/h:div/h:div/span (Review_Helpful)
Pages: [1]
Jump to: