Pages: [1]
  Print  
Author Topic: Downloading PDFs  (Read 321 times)
Dazzerman
Newbie
*
Posts: 4


« on: July 09, 2013, 03:43:11 PM »

Hi, I'm using the "Crawl Web" process to download PDF documents on a Windows 7 Pro machine, using Version 5.3.008 of Rapidminer.  Is there a way of getting Rapidminer to download the documents in question without modifying them?  The resulting files that I am getting are corrupted in two or more different ways.

When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."

When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.

Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.

I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.

Can anyone help?

Thanks!
Logged
Marius
Administrator
Hero Member
*****
Posts: 1753



WWW
« Reply #1 on: July 22, 2013, 09:49:01 AM »

What do you mean by downloading it directly? You mean from the browser? Then probably the file is corrupted on the server, and RapidMiner has no chance of get it correct. If I misunderstood you, please let me know.

Best regards,
Marius
Logged

Please add [SOLVED] to the topic title when your problem has been solved! (do so by editing the first post in the thread and modifying the title)
Please click here before posting.
Dazzerman
Newbie
*
Posts: 4


« Reply #2 on: July 24, 2013, 11:29:21 AM »

Hi Marius, thanks for the reply.

Sorry for the confusion.  I simply meant that if RapidMiner is trying to download a pdf via a direct URL, such as :
www.website.com/folder1/otherfolder/filename.pdf

Downloading the pdfs manually via right-click options works fine.  I can also do it via another WGet application.  It is just when trying to get RapidMiner to download the documents that I get the problems mentioned.
Logged
Pages: [1]
  Print  
 
Jump to: